ED 137 371 



SOCUHSHX SlSOMl 
95 



TM 006 172 



SfOHS ASINCX 

FOB BkTE 
CONIEICT 
HOIl 



IDBS PglCl 
DISCBIPIOBS 



IDENIlPIllS 



Bdrdch^ Garyi And Others ' ' 

Claasroom ObserYation patai Is Xt Talid? Is It 
Geaeralizahle? A Compendinm of MethoaQlogiaal 
Papers* 

Texaa Dniv,, lustin^ Bassarch and Devrjlopaefit Center 
for Teacher Education* 

National Inst, of Education (DHlW) ^ Hashington^ 

CApr 77] 
NI1-C^74^0088 

92p*i Papera presented at the annual Heeting of the 
American Educational Besearch Association {61st^ Hen 
york. He if fork ^ April 4*8^ 1977) 

MP-$0.83 HC-$4.67 Plus Postaga. 
^Academic AchieTement i *ClassrooB Obser¥ation 
Technignesr Classrooa Besearch i EffectiTa leaching; 
Eleaentarj Education; Junior High Schools | 
MatheBatlcal Models | Matrices | Beading Achie^esentl 
Beading Instruction; *Beliability| Student Behavlpri 
♦Student Teacher Belationshipi *leacher Behavior; 
♦Validity; Video Tape Becordings 
CEBII Verbal Behavior Classification Systes; 
Classroom Coamunication Observation System; Flande^^ 
interaction Analysis; Generalizability Theory; 
Observation schedule and Eecord;Spauiding Teacher 
Activity Bating fichedule; Teacher Child Dyadic 
Interaction System (Brophy) 

ABSTBACT 

The issues discussed in these four papers concern the 
validity and generalizability of classroom observation instruments* 
Thesa issues have been studied and are reported here in an attempt to 
better define the limits to which classroom observation instruments 
can be used in researching relationships between teacher beh and 
student outcoae. The premise undergirding these investigations is 
that before consistent and positive process-prodiict relationships can 
be founds investigators must be cognisant of the so'irces of variance 
which affect the validity and generalizability o£ their process 
measures and whici^ in turn^ affect the credibility of their research 
findings* The four papers ares "Convergent and Discriffiinant Validity 
of Five Classroom Observation Systeasi Testing the Modeiw by S* 
Borich^ D. Balitz^ C*L. Kugle^ and M. Pascone; "Generalizability of 
Teacher Behaviors Across Classroom Observation Systems" by D. 
Calkins^ G* Borich^ M* Pascone, and C/L* Kugle; "Measuring Classroom 
Interactions^ How Many Occasions Are Beguired to Measure Them 
Beliably?" and "Generalizability of Teacher Process Behaviors During 
Beading Instruction" ioth by 0, Erlich and G, Borich* (BC) 




4:4 



r > DoGumenti aaquirad by ERIC include many informal unpubUihed materials not available from other iourcei. ERIC 
effort to obtain the best copy available. Nivertheless, items of margin^ riprodudbiHty are often encountered and this effects the 
quality of the microfiche and hardcopy reproductions ERIC makes avdJable via the ERIC Document Reproduction Service (EDRS),;; 
EDRS is not responsible for the quality of the originai document. Reproductions supplied by EDRS are the best that can be made froi^ ; 

ERJC 



CLASSROOM OBSERVATION DATA: 
Is it valid? Is it generalizable? 



i 



A Compendium of Methodologicaf Papera 



GaryBorich 
Dick Calkins 
David MalitE 
Oded Erlich 
Cherry Kugi© 
Maria Pascone 



h Qur, ft now 



,f . ..... * ■ 'i . ■ . r . r s 



afe ! Evaluation of Teaching Project ■ " i 

Plih The Research and Development j ? 

The UnJversity of Texas at Austin C ; \ > ;; .^^^^^ 

Austin, Texas " >^ .\ cf^^^^ 5 



CLASSROOM OBSERVATION DATA: 
Is it valid? Is it generallzable? 



A Compendium of Methodological Papers 

Gary Borich 
Dick Calkins 
David Malitz 

Oded Erlich 
L, Kuglfe 
Maria Pascone 



Evaluation of Teaching Project 
The Research and Development Center for Teacher Educatio 
The University of Texas at Austin 
. Austin, Texas 



Papers prasanted at the A^^^ ^^^^^^ meeting 

New York City^ April 4-8, 1977. 

iJrS "r ?f nnSf ^^Pp^^d in part by NationaHnstitute of Education Contract 

Research and Development Center for Teacher Education Th6 
opinions expressed herein do not necessarily reflect-the p^^^^^^ 

Clonal Institute of Education and no official endorsement by that 
office should be inferred. - 



Preface 



The following papers represent modest attempts Co bring clarity to a 
complex problem. The issues discussed In the following four papers concern 
the vaUdity 3nd generalizablllcy of classroom observation instruments. 
These issues have been studied and are reported here in an attempt to better 
define the limits to which classroom observation Instruments con be used in 
researching relationships hat«een teacher behavior and student outcome. The 
premise underglrding these Investigations is that before consistent and 
positive process-product, relationships can be found, investigators must be 
cognizant of the sources of variance which affect the validity and general- 
iEabUlty of their process measures and which,, In turnriffect the credibility 
of their research findings!. 



GDB 



4 



TABLE OF CONTENTS 



G, D. Borich, Malits, L. Kugle, and M. P 



"Convergent and Discriminant Validity of Fiv© Classroom 
Observation Systems r Testing the Model'* 



D. Calkins, D. Borich, M. Pascone, and C. L, Kugle 

••Generaliiability of Teachar^BehaviDrs Across Classroom 
Observation Systems'' 



3, Erlich and G. D. Borich 

"Measuring Classroom Interactions : How Many Occasions 
are Required to Measure Them Reliably?" 



Erlich and G* D, Borich 

"GeneraliEability of Teacher Process Behaviors During 
Reading Instruction" 



Convergent and Discriminant Validity of Five Classroom 
Observation Systems; Testing the Model 

Gary D. Borich, David Malitz, 
G. L. Kugle, & Maria Pascone 



The University of Texas at Austin 

Numerous instruments have been developed to systeMatlcally obsei-ve classroom 
behavior. Thase Instruments typically conslsc of a number of categories of teacher- 
: student behavior which an observer tallies or rates periodicaily as he watches class^ 
room Interaction. For the greater part of a decade researchertf have used ' such 
instruments to investiiate the relaclonship between teacher behavior and student 
outcome, but this effort has yielded relatively few consistent findings. ^'^ IJhlle 
many possible reasons for the dirth of consistent findings can be advanced, two 
which must be donsidered are that the research model or theory implicit In process- 
product investigations may ba Inadequate or too simplistic to uncover such 
relationships and psychometirlc weaknesses within instruments used by the researchers 
may obscure any underlying relatlonshi,^.s which do exist. 

At present, there Is no a priori reason to suspect one of these possibilities 
over any other. However, as process-product studies themselves confirm (Brophy & 
Evertson, 1976; Good & Grouws, 1975, McDonald et al., 1975| Stalllngs & Kaskowltz, 
1974) there has been a conspicuous lack of validity studies of the research 
instruments used, especially instruments to measure teacher behavior. Taking note 
of this Borlch (1977) detailed several of the most salieht^Bources of invalidity 
afflicting observational measures of teacher behavior, but did not provide empirical 
data as to the actual effect of these sources of Invalidity on Instruments used 
to measure teacher behavior. The present study undertook to determine the extent 
to which one of these sources of invalidity, the lack of convergent and discriminant 
validity, was present in five classroom observation systems. The validity model ' 
reported by Campbell and Flske (1959) was employed! this model requires that both 
convergent and discriminant validity be damonstrated., \ 
, Convergent vaXldity is a confirmation of traits (or variables or categories) 



by independent measuring methods that requires significant correlation becween two 
methods (or systema) measuring the same trait. Discriminant validity Is a require- 
ment chat "the correlation between different measures measuring the same trait 
exceed (a) the correlations obtained between chat trait and any other trait not 
.paving method in common and (b) che correlations between different traits which 
.happen to employ the same method" (Borich & Malitz, 1975). By determining 
intercorrelations among categories in a multiCrait-multimethod matrix, one. can 
identify categories which pass specified tests of convergent and discriminant : 
validity. These procedures were applied to the following data In order to 
ascertain the external validity of five classroom ohsarvatlon systems. 

Method 

The data were obcained from videotapes of Cwelve in-service Junior high 
schdoi teacher s, each teaching che same content, a unit in social studies, on three 
occasions of approximately 50 minutes duration. Each of 36 videotapes was rated 
by five pairs of coders, each pair trained in a different observational coding 
system. For two of the five observational systems, coders were employed who had 
.previously been trained by the authors of these systems, these being, the two most 
complex syscems. The remaining three pairs of coders were trained by the'"" 
Investigators from training materials supplied by system authors and from standard 
protocols from system manuals. 

The five systems employed for this study were selected from Simon and Boyer's 
Mirrors for Behaviors (1970). The systems T^ere (1) the Observation Sehedulo and 
Record, OScAR 5, (Medley & Mltzel, 1959) , (2) Spaulding Teacher Activity Rating Schedule 
STARS (Spaulding, 1967), (3) Flanders System of Interaction Analysis, (Flanders, 
19 71), (4) CERLI Verbal-Behavior Classification System, CVC (Cooperative Educational" 
Research Laboratory) (Note 1), and (5) The Classroom Communtcation Observational System, 
ceo (Withal, Lewis & Newell, 1951). These systems were selected because of their 
availability to the educational research community (and therefore presumed use) and 



for the number of categories and associated operational definitions they had in 
common-=the latter being a requirement for assessing convergent validity. 

Upon completion of cralning, system coders, using their respective systems, 
rated three trial videotapes of the same general form and content as the experimental 
tapes in order to obtain estimates of interjudge reliability prior to the study. 
While reliabilities varied due to system compleKlty, all were deemed acceptable 
and are reported in Table 1 along with the median reliability for each pair of 
coders over all 36 tapes. 

Insert Table 1 about here 



Descriptions of the behavior categories of the three systems were obtained 
from the coding manuals, and categories were grouped across systems, if from thi 
category descriptions it appeared that they measured the same behavior. From 
these comparisons, two categories were paired across the Flanders and CCO systeniR, 
three categories were paired across the Flanders and OScAR systems, three categories 
were paired across the STARS and CVC systems, four categories were paired across 
the OScAR and CCO systems, two categories were paired across the STARS and OBcAR 
systems, and six categories were paired across the CVC and OScAR systems, for a 
total of siK two-system comparisons. In addition, there was one Chree-system 
comparison, two categdrles were compalred across Flanders , CCO, and OScAR. A 
description of the behaviors comprising these comparisona appears In ^ . 
Appendix A. 

In certain cases, a single variable from one system was pari ed with several 
variables in another system. This procedure was most commonly employed when a 
subset of categories on one system was encompassed by a single general category on 
another and when members of the subset were coded Independently of each other, i.e/, 
were discreet behavioral categories. For example, in the Flanders vs. OScAR comparison, 
the total Flanders frequency In catagary 9, Student Taik-Initiation was correlated 



with the sum of the OScAR frequenciee in categories 10. Pupil Nonsubscantlva 
Utterance; 20, Pupil Question; 30, Pupil Statement; and 40, Pupil Rasponsa, 
Matches were not made, however, for which the meaning of a category on ona system 
would have to be split among different categories on the other system, i.e., a 
category could not be applicable Co more than a single category or homoianaous 
subset of categories on another system. 

Once the categories/to be investigated had been identified, Pearson product- 
moment correlations were computed. These correlations were used to construct seven 
multitralt-multlmethod matrices. For each matrix, a heterotrait-heteromethod block 
was formed with those values in which categories may or may not coincide but 
systems differ* A heterotrait-heteroraethod block is illustrated in Figure 1. 



Insert Figure 1 about here 



For each matrix, a diagonal (called the validity diagonal) is formed through 
the heterotrait-heteromethod block by the series of cells in which categories^ ^ 
coincide but systems differ* Values in the validity diagonal which are significantly 
different from zero are evidence for convergent validity. Discriminant validity 
must be assessed in two steps. First, each validity value must be compared with 
all values in its row and column in the heterotrait-^heteromethod block to determine 
whether the correlation between different methods of measuring the same category 
exceeds correlations between that category and other categories not having method 
in common. Second, the heterotrait^monomethod triangles are examined to determine 
whether the correlation between different methods of measuring the same category 
exceeds correlations between that category and other categories which have method 
in conmioni This step is completed by comparing each category's validity diagonal 
value with values in the heterotralt-monomethod triangles in which that category 
is involved. This two-step procedure was carried out for each validity diagonal 
value in each of the seven matrices , and the results entered in Tables 2-8, In 
Figure 1, the validity diagonal for category "A" is significant at the ,05 leve] 
and, therefore, it can be taken as evidence for convergent validity. Also, category 



"A" presents good evidence for jiscriminant validity, since Its validity diagonal 
value exceeds all of the values specified in the two-step procedure outlined above. 
,^^"Sory "B", on the other hand, indicates neither convergent nor dl^riminant 
validity. 



Results 



Seven matrices resulted from the process of comparing categories and groupH of 
categories across the five systems. Six of these matrices compared categories 
across two systems. The seventh matrix Involved three of the systems. No matching 
categories were found to exist across any four or all five of the systems. A 
. category or group of categories which was found to match across two or more systems 
will be referred to as a comparison category (CC) . Twenty=three such CC's were 
created and will be referred to by number (i.e., CCl through CC23). Appendix A 
lists each of these 21 CC's and the constituent system categories which comprise 
each CC. Of the six, two-system matrices, five contained four or less CC's. The 
multitralt^multimethod (MTMM) matrices for these five, two-system comparisons 
are shown in Tables 2 through 6. The three-system matrix is presented in Table 7. 
One two-system matrix contained six CC's and is shown in Table 8. Since this 
matrix is somewhat cumbersome to evaluate in its raw form, a summary table. Table 9, 
was constructed to aid in its evaluation. 

Table 2 shows Che matrix resulting from the matching of two categories across 
the Flanders and CCO systems. It can be noted that both CCr and CC2 pass the 
criterion for convergent validity since both CC's have significant validity 
diagonal values (.7699 and .6620 respectively, r^^^ = .325, df = 35). Since both 
CC's pass the test for convergent validity, they may be examined for discriminant - 
validity. It will be recalled that determining discriminant validity is a two- 
step process. The first step involves . comparisons of each CC's validity diagonal 
value with the other values in its row and column in the heterocralt-heteromethod 
block.- The. second step requires comparison of the validity diagonal value for 



each CC with values in the heterotrait-monomethod triangles. For both CCl and 
GC2. the validity value exceeds the heterotrait-heteromethod values. Thus, both 
CC's meet the first clrterlon for discriminant validity, since in both cases the 
correlation between different methods of measuring the same behavior axcesds 
correlations between that category and other categories not having method in 
common. In addition, the validity value exceeds the hetGrotralt-monomethod values. 
In other words, the correlation between different methods of measuring the same 
behavior exceeds correlations between that category and other categories having 
method in common—the second step. In summary, CCl and CC2 pass all tests for 
convergent and discriminant validity. • 



Insert Tables 2 & 3 about here 



Table 3, contains the results of matching categories across the Flanders 
and OScAR syscems. Three CC's resulted from this comparison, CC3 , CC4, and CC5. 
Examination of the validity diagonal reveals evidence for convergent validity for 
CC3 and CC4 since their values are significant. CCS has a nonsignificant validity 
value, and therefore need not be examined for discriminant validity. CC3 and CC4 
pass both the first and second steps for discriminant validity, since their 
validity diagonal values exceed all relevant values In both the hetGrotralt= 
heteroraethod block and in the heterotrait-monomethod triangles. Thus, CC3 and 
CC4 pass all tests for convergent and discriminant validity. CC5 lacks evidence 
for convergent validity and therefore its discriminant validity need not be 
examined. 

Table 4 shows the three CC's (CC6, CC7, and CCS) resulting from comparison ' 
of the CVC and STARS systems. None of these three CC's have significant validity 



Insert Table 4 about here 



values and therefore lack evidence of convergent validity. Discriminant 



7 



also necessarily lacking and therefore need not be examined. 
:, . Table 5 Indicates that four CC's (CC15, CC16» CC17, and CC19) resulted from 
comparison of the CCO and OScAR systems. 



Insert Tafcle 5 about here 



Exarainacion of the validity diagonal values indicates that only CC15 and CC19 have 
Significant values. However, both CC15 and CC19 fall the first step in the assess- 
ment of discriminant validity since both are exceeded by the heterotrait-heteromethod 
value of .5054. While the validity values for these categories pass the second 
test by exceeding all values in the heterotrait-raonomethod trianila, they do not 
pass the first test for discriminant validity. Thus, while CC15 and CC19 
show evidence for convergent validity, they show mixed results for discriminant 
validity. 

Table 6 shows the results of the comparison of STARS with OScAR, Two CC's 
CCC20 and CC21) resulted from this comparison and both pass all tests for conver- 
gent and discriminant validity. 



Insert Table 6 about here 



5^hea the Flandera, CCO, and OScAR Systems were compared, two CC's 
were found. The results or the comparison of CC22 and CC23 are presented in 
Table 7. Analysis of a three-system matrix proceeds In exactly the same manner 



Insert •Table 7 about here 



as the analyses of a two-system matrix, except that Instead of one validity diagonal 



to examine, there are now three (corresponding to the three system pairings) : 
..Examination of the three validity diagonals Indicates all values are significant. 
^^Furtheroore, It can be noted that each of these values exceeds the relevant 
^ in the heterotralt-heteromethod blocks and In the he terotralt-monomathod blocks. 
Thus, both CC22 and CC23 pass all tests for convergent and discriminant validity 
in the Flflnders, ceo, and OScAl comparison. 

; Lastly, comparison of the CVC and OScAR systems resulted in the creation of 
six CC's (CC9 through CC14). The correlation inatrlx for these comparisons is 
shown In Table 8 while a summary table of these. data are presented In Table 9. 
Table 9 shows the validity diagonal value for each CC. In addition, data are 
prasented pertaining to each CC's dlscriniinant validity (the highest value in the 
relevant parts of the heteromethod and monomethod blocks and the number, of. times 



Insert Table 8 & 9 about here 



the validity value is exceaded In each of these blocks) . 'Examination of this 
table reveals that three CC*s(CC10,CC12, arid CC14) have non-slgniflcant validity 
_ values and therefore lack evidence for convergent validity. Of the remaining 
,;*^hree CC's, all show good evidence of discriminant validity since their. -validity 
values are exceeded by none of the relevant heteromethod or monomethod values. 

•:: Comparison of the five teacher observation systems employed In this study 
produced 23 CC's. Twenty=one of these CC's were Inyolved in two-system compari= ' 
sons and two in a three-system comparison. Of the 23 CC's, 13 (57%) showed 
evidence of convergent validity. Eleven of the 23 CC's (48%) passed tests for - 
l^othJconyargent and discriminant validity. Thus, ^ of the CC's which were analyzed, only 
about half "conformed to Campbell and Fi^ for convergent and 



Discussion 



The purpose of this research has been to evaluate the convergent and discriminant 
, validity of five classroom interaction systems m.icY. either have Tjeen used in studies 
relating teacher behavior to pupil outcome or are reasonable representations of the 
types of systems which have been used in this research. It was Che Investigators' 
belief that at least one explanation for the large number of inconsistent and "null" 
findings in process-product studies was that the instrumentation used to measure 
classroom behavior, particularly teacher process behavior, may not exhibit convergent 
and discriminant validity. The findings of this study support this conviction, since 
about half of the teacher process behaviors investigated failed to pass tests for 
^^"^5^S«"t and discriminant validity. While no reference to specific process- 
product studies need be made, the iuvestigators suggest that many such studies 
have measured behaviors with similar forms of Instrumentation and some studies 
have utilized the same Instrumentation as was studied in this investigation. ^ 
Based upon the results of studying . five classroom observation systems, the 
, implications are not particularly encouraging for researchers who choose to measure 
classroom interaction. One can infer that of the hundreds of other observational 
coding instruments which have been developed, many muse contain categories which 
do not meet the standards of convergent and discriminant validity proposed in this 
study. Process-produce researchers as well as those who attempe to aggregate and 
accumulate the findings of process-product research might well be advised to exer- 
else caution in drawing concluslor.s from studies which use classroom observation 
systems for which the measurement technique Itself accounts for greater variacion 
than Che behavior being measured (lack discriminant validity) or that Incorporate ' " 
behaviors which when measured by different systems fail to correlate (lack con- 
vergent validity) . ■ ^ i 
As a resulc of using the MTMM technique to evaluate validity, two types 



• °^ instrument flaws 'became apparent. first concerns the redundancy or overlap- 6fS 

- behavioral meaaures within systems reducing a constructs chances of exhibiting dis-' - 
criminant validity.- While complete independence of the behaviors measured within a 
system is not expected, significant Interrelationships among behavioral categories ' I,; 
substantially reduce the chances of these categories passing teats for discriminant 
validity. In several Instances In this study, interrelationships among behaviors v 
precluded any chance of a category exhibiting discriminant validity. For example, " 
^in the first heterotralt-monomethod triangle In Table 4, CC6 and CC7 correlated .7470, 
the highest correlation In the matrljt. Note that in this Instance even If the 
validity diagonal values had been beyond significance Cr = .325) , they probably 
would not have surpassed the heterotrait-monomethodvalua and thus the category's 
discriminant validity would still have been rated "poor". When pilot testing 
classroom observation Instruments, authors might delete highly redundant categories 
or attempt to reduce the significant interrelationships among such categories by 
providing more specif Ic operational definitions in order to increase the discriminant 
validity of their instrument. ' 

The second Instrument flaw which came to light with the technique was the 

relatively large number (43%) of teacher behaviors which failed to correlate 
significantly with behaviors on other instruments with which they were matched. I.e. , 
lacked convergent validity. IJhile some of these numbers might be accounted for by / 
the inexactness of the matching process inherent in applying the "MraM technique to 
classroom observation instruments, in general, the matches that were made In this 
study may be considered conservative and were often supported by the same or similar 
operational definitions. Thus, some of the seemingly similar constructs of the type 
reviewers of process-product studies relate across studies when aggregating process- 
product findings were, In-fact, found not to be similar In this study. One might - 
account for this finding by method variance which confounded the measurement of 
approximately half the behaviors In this study, vague operational definitions of 



erJc 



;. behaviors when actually interpreted by coders , and intrinsic coder differences . 
■This lack of convergent validity suggests that the descript^ titles of categories 

-and behavioral constructs employed i^^ systems may not 

■adequately represent the behavior they purport to measure. Since this flaw is ^ 

a between-system problamV^^^^^^^ 

operatlonalizatlons of their constructs when constructing new systems. 

- Evaluating convergent and discriminant validity with the multltrait-multimethod 
procedure is one approach to assessing the validity "of an instrument . The purpose 
of the remaining portion of this discussion will be to outline both the practical 
problems encountered in Its use in this study and the theoretical assumptions under- 
pinning the technique. 

Campbell and Fiske (1959) introduced the .technique- with examples drawn 
primarily from the literature In personality and Industrial psychology. In these 
examples, authors attempted to assess various traits (e.g. , assertlveness'l cheer- 
fulness, poise, popularity, intelligence, etc.) using two or more methods j (e.g. , 
self rating vs. peer rating, paper-and-pencil test vs. direct observation,' etc.). 
Thus, the authors of these studies devised different methods for measuring the 
same variables.. t 

Our use of the technique was somewhat different . Rather than use different 
methods to measure the same variables, we took existing' methods which measured a ■ 
variety of variables and tried to specify the variables which were measured in ^ 
comnon across methods. Thus, our methods and variables were not tailor-made to 
our research situation. Instead they were fitted to our research needs, and It 
was this fitting process which created some practical problems. 



We found the five systems to be quite different la the way they categorized 

.classroom behavior. One was based upon reinforcement continiencles in the class-" 
room, another on teacher-student Interaction, a third sought to categorize 
teacher behavior so that teacher styles cculd be described. Each system reflected 
a particular author's view of the classrooin, representing those variables thought 
to be most Important. Thus, different systems sliced the pie of classroom 
behavior differently, although overlap was apparent. Our approach was to treat 
the overlap across systems as the basis for constructing comparisons which could 
be used to determine the convergeht and discrlminatit validity of behavioral 



categories within systems. However, our success at this was dependent upon our 
ability to create fair and accurate matches across systems which defined behavior 
; categories differently. 

Often the problem of matching categories reduced to shades of meaning. For 
example, CC was composed of "giving directions" in Flanders and "gives directions" 
in ceo. This would seem to be a straightforward match. However, for CC2 we 
matched "silence or confusion" with "no communication." While perhaps not exactly 
equivalent, these categories also seemed to have behavior In common. • However, 
matching sometimes became an ambiguous and inexact task. For example, is \ 
"telling simple facts" plus "telling complex facts" equal to an "informing state- 
ment" (CC21) or do the telling categories Include more, or less, than the Informing 
category? Clearly, matching in this case is not the same as In the studies cited 
by Campbell and Flske where dlffereht methods were designed to measure the exact 
same variables. The applicability of the mm technique to a particular validity 
problem must ultimately depend on the redundancy of categories and operational defi- 
nitions across instruments and the conciseness with which matches can be made. 



A^-^L:-;,::,.^Vci addition to semantic differences among category definitions, another prablem; 
Fcompllcated the natching process. This problem Involved the differences which 
t^r-soraetlmes exist betwee . the way a cateiory is defined in a manual and the way it is 
used by coders. If a system is to be used reliably by raters, categories 

for the coders. This, of course, is the purpose of 
-training.- However, it is not poasible to Include in a definition in a manual all 
of Che information necessary to code a particular category reliably. Often coders 
find it necessary to create "ground rules" to delimit the boundaries of particular 
categories. Coders, for example, might have difficulty distinguishing between 
the categories "teacher accepts" and "teacher approves." To distinguish between 
these behaviors, they may create certain ground rules for coding. For example, 
, coders might decide that if the teacher uses an exclamation such as "Oh I" or "My."' 
in regard CO a student's comment, the proper code is "teacher approves;" other- 
wise, the code Is "teacher accepts." From our experience in this study, ground 
rules like this are not uncommon, and while they do not seem to distort the mean- 
:ing of the categories, they delimit their meaning In a way which might not be " 
apparent to a reader of the manual. Furthermore, it is not uncommon for coders 
or system authors • to modify ground rules to fit different classroom situations. 

.Since the actual operationailzations of categories can change from coder to 
coder or from study to study (depending upon the classroom situation being coded), 
the manual definicions, besides being somewhat ambiguous, are at times only guide- 
lines to the meaning of the categories. Thus, the exactitude of the matching 
process may vary across cotitexts and coders. 

Certain theoretical considerations are also of interest. One of these con- 
cerns the: independence of methods of measurement. The multitralt-multiOTethod 
technique is based upon the use of independent methods of measuring the same 
variables. Although Campbell and Fiske note that independence is a matter of 



■14;..; 



;;^^g^ degree, Calkina, Halite, NatallcioV: and Mote (Note 2), point out that the "d€Cer« 
gV-nid|ation of validity Is enhanced by the inclusion of methods of measure^eSc which 

as ^^sifale" (p. 2). The reason for this that the '^determination 
- of convergent and discriminant validity for a set of variables can be obfuscated 
? : traits;have been quantified by the same method of measwement. If all the 

tralts w by the same method, high correlatlone could result becayse 

all the variables. share ^method variance^ (pp, 1-2), The extreme case of non- ^ 
-.independence is where eKactly the same method i# used to measure the variables. 
In this case, the values In the validity diaional are merely reliability values. 
Since high reliability can be obtained in the absence of validity, this extreme 
S — not address the issue of validity. Thus, to the extent that the 
, /methods are not ^ eechnique will not yield useful validity data. 

3 : In the case of our st^ as Independent as one might 

wish, since ail ^w of behavioral observation. While they repre^ 

, sented different theoretical rationale and time intervals for collecting data, all 

i^^^^^^ as low to medium iriference counting systems producing frequency 

; data. Given that the independence of different classro^^ observation systems may 
; be difficult to asseft, one might Include In studies using the m>ni procedure both 
.Ipw inference counting systems and high Inference rating scales in order to assure 
the maximum amount of Independence among measurement instruments. ' ~ 

= A second theoretical con use of the WDM technique involves 

; Its statistical assumptions. Kallberg and Kluegel (1975) point out that Campbell 
and Fiske' a technique assumes that t raits and methods are uncorrelated and that 
methods are minimally correlated with one another* Kallberf? and Kluegel assert i 
that it is unreasonable to assume that method f actors are uncorrelated with trait 
factors. Moreover^ as pointed out above. Independence of methods is sometimes 
difficult to achieve and the degree of Independence-whlch exists is difficult to 



19 



ERIC 



g;^F;:^ssess. . Thus. Kallberg and Kluegel view the model Implicit in the MTMM technique " 

pyV'^. f^.^^^^''^^^^^' As an altarnative, they recommend confirmatory factor analysis, 

1969' 1970) , a technique based on the Wares and Linn (1970) path 
analysis model. Several Alternative methods of analyzing MTMM matrices including 

g; . CFA have been developed, some based upon analysis of variance techniques and some 

p . based upon factor analysis (See Alwin. 1970, for a review of these techniques). 

•;■ Of these techniques, it appears that CPA is based upon the least restrictive model 
: . (In terms of statistical assumptions about the data). However, CFA cannot be used 

,, to its fullest extent unless the matrix contains three or more methods and three 
or more variables. None of the matrices produced in this study met this requlre- 
ment. In addition, CFA is based upon rather rigorous statistical and mathematical 
derivations that are not easily fathomed by the average researcher. This makes 
the workings of CFA and its output rather difficult to understand and to comi^unl- 

. cate to other researchers. Until CFA is better understood end more widely accepted, 
it does not seem to be a practical .alternative to Campbell and Flske's technique. 

- Which, despite its assumptions, seems particularly suited to practical psychometric 
application in process-product research. 

This study tested the applicability of the MTMM validation procedure to class- 
. . room Observation instruments. The study brought -to light several nuances and 

- assumptions of the technique which define the conte« m which MTMM is most appro- 
priate. It was found that the applicability of the mm procedure can be expected 
to vary across validity studies, depending upon two primary considerations: (1) 
the conciseness in which behavioral categories can be matched across classroom 

; Observation .ystems. and (2) the degree to which the investigator can include 

- comparison instruments in the validity study of sufficient variety, e.g., low vs. 

high inference, or counting vs . a rating metric . assure a reasonable degree of ^. " 

lodependence among methods. To the extent that these considerations are addressed, 

the validation procedures employed in this study were found to constitute a poten- 

- .tially economical and practical model for examining the validity of .other classroom' 
. -Observation systems , '. ' " 

20 

ERIC- 



16 



. Reference Notes 

1- CERLI Verbal-Behavior Classification System (CVC) . by Cooperative - 

Educational Research Laboratory, Inc. By permission of Evarette Brenlngmeyer, 
Cooperative Educational Research Laboratory / Inc 

From a Report to the Office of Education, U, S. Department of Health, 
Education and Welfare, Contract DEC 3"7-061391«3061, April 1969, 



2, Calkins, D. , Malltz, D,, Natalicio, l/, & Mote, T. Validation of 



some 



pencil-a nd-paper anxiety measures by the mul-titrait-multimethQd procedure 
Paper presented at the XVI Interamerlcan Congress of Psychology meeting, 
Miami, Florida, 1976. 



ERIC 



2 V 



References 



^PP'^°achas to the Interpretation of relationships in the 

In E. F. Borgatta & G. W. Bahmstedt (Eds.), 
. . SociQloplcal methodolopv 1970 . San Francisco, California: Jossey=Bass, 1970 
Borlch, G. D. Sources of invalidity in measurins classroom behavior . 
Instructional Seiftnrp; fi rq^ ^ i-^ p^,. 

Borich. G. D., & Malitz, D. Convergent and discriminant validation of three 
classroom observation systems! A proposed tnodel. Journal of Educational 
Pgxchologx, 1975, 67 (3), 426-431. 

Brophy. J. E., & Evertson, C. M. LearninB from teaehine . Boston- Allyn & 
Bacon, 1976. 

Campbell, D. T. , & Fiske, D. W. Convergent and discriminant validation by the 

multitrait-multimethod matrix. Psychological Bulletin . 1 QRQ / ^ 81-105. 
Chall, J. s., & Feldman, S'_ C..^ study in depth of first grade rearfjng. ' = 

(U.S. Office of Education Cooperative Research Project NoV 2728 York: 
The City College of the City University of New York, 1966. 
Flanders, N. A. Analyzing teac hing behavior . Reading, Massachusetts: AddlRon- 
Wesley, 1971. 

Godbout, R. C. Marston, P. T. , Borich, G. D. , & Vaughan. C. The problem of 

spurious sisnlficance in elasaroom education research (Res. Rep . 10) . Austin , 
Texas: The Research and Development Center for Teacher Education, The 
University of Texas at Austin, 1977. 
Good, T., & Grouws, D. Teacher rapport! Some stability data. Journal of 

Educational Psvchology , 1975, 62, 179-182. 
Joreskoy, K. G. -- A general approach to confirmatory maximum likelihood factor 
analysis. Psychometrika . 1969. 34, ia^-?n7 

Kallbarg, A. L. . s Kluegel, j. r. Analysis of the multitrait-n,ethod matrix: S 

alternative. Journal of Applied Psychology . 1975. 60, 1-9 

^Ti:;-. y 22 - . . - - \ 



ome 



McDonald, F. J. . Ellas . P. , Stone, M. , Wheeler, P. . Lambert, N. , Calfee, R. , 

Sandoval, J. , Ekstro m, R. , & Lockheed, M. Final report on phase II 
, beginning teaehBr evalu ation study . Prepared for the California Cotnmisston 

on Teacher Preparation and Licensing, Sacramento, California. Princeton: 
, Educational Testini Service, 1975. 
, Medley. D. ..M. , & Mltzel, H. E. A technique for measuring classroom behavior. 
Journal of Educational Psychology , 1959. 49. 227-239. . 
Rosenshine,B. Teaching behaviou rs and student achievement . London: 

- International Association for the Evaluation of Educational Achievement, 
1971. 

Simon, A. , & Boyer, E. G. (Eds.) Mirrors for Behavior . Philadelphia 1 Research 

for Better Schools, Inc. , 1970. 
Solomon. D.,Bezdek, W. E. .& Rosenberg. L. Teaching styles and l ep^^nin^. 

Chicagoi The Center for the Study of Liberal Education of Adults, 1963. 
Spaulding, R. L. The Spaulding teacher acclvity rating s chedule (STARS) . 

Durham. North Carolina: Education Improvement Program, Duke University, 

Stalllngs. J. , & Kaskowitz, D. Follow through classroom observation evaluation 
. v l972..1973 . Menlo Park, California: Stanford Research Institute. 1974. 
Wallen, R. L. , Sales, S.M.. & Bode, S. Student authoritarianism and teacher 

authoritarianism as factors in the determination of student performance 
^-jand attitudes. Journal of Exparimental Education . 1970. 38 f4V, 83-87. 

* Path analysis: Psychological examples. 

; PszcholoBlcal Bulletin , 1970. 74. 193-2127 . _ 

'^^^^^r ^'^-^^' & Lewis.*W. W. Development of classroom observational 

categories within a conmunica tlon nodel . Madison, Wisconsin: School of 
Education, University of Wisconsin i 1961. - 



Footnotes 



1. - see Borich (1977, Chapter 6) for the results of flva larga-.lala studiae which havi 
... Investigated the relationship between teacher behavior and student outcome ' t 

and especially pp. 76^78 for a table of consistent and Inconsistent findings 

across these studies. 

^/ 

. The tendency to (1) report significant findings which fail to exceed the ^ 
number expected by chance and (2) ignore differences in the operational ■ 
definitions of purportedly similar constructs serve as examples of the i 
problems which have either reduced the credibility of "significant" process- 
product findings or led to the proliferation of "null" or inconsistent 
findings, 

Rosenshine's review (1971) illustrates theie problems. Rosenshine " 5 
examined the findings of approximately SO different studies in which 6ver 
200 separate teacher behaviors were investigated. On the basis of evidence 
from these studies, 11 behaviors were selected as potentially promising in 
relation to pupil performance. In interpreting the efficacy of these 11 I 
behaviors, however, we must remember that they were derived, for he most 
part, from correlational, not experimental, studies. Therefore, causation 
cannot be inferred. Furthermore, these behaviors were derived from clusters 
of heterogeneous research studies which actually showed mixed results; some 
studies within a given cluster failed to confirm the efficacy of the variable 
in question. Also, variables were often operationally defined differently by 

finally, in some studies the number of signif- ' 
leant findings failed to exceed that which could be expected by chance. _ S 

The problem of operational definitions is Illustrated by the teacher 
variable clarity .^ ^hlch. Rosenshine points out, has been defined in three 



very dif f erent ways I 

. ; /; the teacher mad'e were clear and easy to understand" 

- : ; " (Solonian, Bezdek, and Rosenberg, 1963) ; 
■ (2) whether "the teacher was able to explain concepts clearly .. . had 

the facility with her material and enough background to answer her 
children's questions intelligently" (Wallen, 1966) ; 
(3) whether the cognitive level of the teacher's lesson appeared to be 

"just right most of the time" (Chall and Feldman, 1966) . 
The problem of chance significance is illustrated by a finding which, I 
suspect, is not uncommon. Godbout, Marston, Borich, & Vaughan (1977) had ' 
occasion to analyze the extent to which process-product relationships In a large- 
scale teacher effectiveness study replicated over two consecutive years, during 
^whlch time Instrumentation and teacher sample remained constant. Of the 3,050 
relationships studied, only 24 were significant at p < .10 in the same direction 
for both years. A much more favorable result would have been expected on the 
basis of chance alone. Unfortunately, since few replications of this type are 
conductted, process-product researchers have no way of knowing how unstable their' 
findings may actually be. 



25 ' 



21 



el. Coder reliability before and during study.* 



System 


Highest Prestudy 
Reliability 


Median for 
36 Tapes 


STARS 


72 


72 


OScAR 


88 


91 


FLANDERS 


91 


93 


CVC 


79 


82 


ceo 


86 


86 


Scott's coefficient. 







- - ■ .- ■ ■ ■ - - . . ■ ■ ■ . - ■•: . -^--.FVf^H 

■ ...... . . . ^ . , , . .^^^^l 

■J™" - 26 
ERIC 



22 



Table 2, Flanders vs. CGO 



_ Flanders 
CCl CC2 



ceo 

CCI CC2 



Flanders 



CCi 



CC2 



.6420 



ceo 



CGI 



GC2 



.7699 .2779 
.5620 .6620 



.3454 



27 



23 



CC3 

Flandera CC4 
CCS 

CC3 

Oscar CC4 
CC5 



. Table 3. Flandera vs. OSCm 

Flandera Oscar 
CC3 CC4 CC5 CC3 CC4 CC5 

.2247 

.0684 ,1369 

.8808 --.1399 ,0399 

.1268 .8571 , 2904 -.1331 

J2M^_ . ^aZSZl l._,J18gl. ..140L.....^^0384^^-:^ 





Table 4. CVC vs. STARS 



CVC 



CC6 
CC7 
CC8 



CC6 

,7470 
1478 



CVC 
CC7 



,0072 



CCS 



CC6 



STARS 



CC7 



CCS 



STARS 



CC6 
CC7 
CCS 



.0420 
,0560 
1033 



.0279 -.2484 
1766 =.1525 



-.1476 



.1618 



.0121 

.1385 0.1730 



ERIC 



29 




Table 5, CCO va. OSCAR (CCO #18 Mean 0.00) 



ceo 



OSCAR 



CCO 



OSCAR 





15 


16 


17 


19 


15 


16 


15 














16 


-.1576 












17 


-.0101 


.1420 










19 


.2797 


-.0773 


.1237 








15 


.4480 


.0130 


.0973 


.0410 






16 


-.3171 


.1445 


.0030 


-.1983 


-.0112 




17 


-.2213 


.4642 


-.0643 


.0733 


.2283 


.1824 


19 


.5054 


-.2953 


-.1057 


.4370 


.2738 


-.2061 



17 



19 



,1677 



o"1 
ERIC 



30 



Table 6. STARS vs. OSCAR 



26 



STARS 



ST^S 



20 
21 



20 



-.4358 



21 



OSCAR 



20 



21 



OSCAR 



20 .6165 -.3520 

21 -.3314 .8538 



-.2478 



ERIC 



31 




Table 7. Flandera vs. CCO vs. OSC^ 



27 



Planders 



22 
23 



Flanders 
1 2 

.2247 , 



CCO 
1 2 



OS CAR 
1 ' 2 



CCO 



22 



.5743 -.3616 



23 -.2171 



.7782 



-.2757 



. . .22^ - . 8808 - . 1399 
OSCAR 

23 -.1268 .8571 



T5441 -.1840 
-.2923 .6811 



-.1331 






ERIC 



32 



Table s, cvc vs. OSCM 



CVC 





9 


10 


11 


12 












10 


,0832 












.5459 


.1248 








12 


.0858 


.5504 


.2034 






.1452 


.0600 


.4429 


-.1568 


: 14 


-.2915 


-.0350 


-.3095 


.0805 


; 9 


■ ;■ - 

.6746 


-,10J8 


.1829 


-.0350 


,10 


.3622 


.1758 


-.0933 


.3550 




U 


.0236 


-.1299 


,6402 


-.4387 


- i 

0 


12 


-.0531 


,0866 


-.232? 


.3088 




.1696 


,0200 


.3??9 


-.2166 




. '2008 . 


..2917 


-.0933 


.3293 



.0283 



OSCAR 

^ 10 U 12 13 14 



.1043 

.0494 -,2702 : 

%1845 43W ^ -.3163 J 

,0684 ,0148. 2488 -.2880 " 

'.1467 ,4183 -.5148 .2790; -,0209 



50 . 



29 



Table 9. Summary table;' CVC vs. OSCAR 



comparison 
category 


validity 
diagonal 
value 


highest 
value in 
heteromethod 


no, higher 
than validity 
value 


highest 
value In 
nonoaethod 


no higher 
than validity 
value ; " 


CC9 

CCIO 

ceil 

CC12 
— ~ CC13 " 
CC14 


.6746* 
.1758 
.6402* 
.3088 

" :6440* 
.1928 


.3622 
.3622 
-.4387 
-.4387 

.3293 


0 

0 
0 

3 
0 
5 


.5459 
-5504 

.5459 
.5504 
.4429 ~ 
-.5148 


■ 4 ■ V ; . ^ 

: 0 
6 



* P < .05 



ERIC 



30 



II 



System I 

accepts questions 
A B 
A (.16)* 



B .23 



(.70) 




System II 

values delves 
A B 



(.58) 



-.14 



(.84) 



*InterJudge reliabilitlas. 

Figure 1. Simplified Illusttratlon of the Vala^tatlon Model. 

-Tiiej*;alMifiX_diagonal,,^^^^^^^^ 

block » .43, -.01, -.10, -.12, The monoinetliod triangles = .2'3 and 
-.14, respectively. 



ERIC 



36 



Appendix A 





Behavior 


categories making up each of the comparison 


categories. 


Table 


System 


General Category Description 


Category Numbers in 
Respective System 




FLANDERS /ceo 








CCl 


v»ivxng Qirectlons 


6/7 






Silenca or confusion 


10/13 


2 


r w\r4 U&Kb / UbLAK 








CC3 


Accepts feallngs praises ^ 
encourages ^ 


1, 2/2, 12, 22, 32, 
42, 52, 62, 72, 82, 92 




CC4 


CrlticlEing or Justifying 
= authority 


7/6, 16, 26, 36, 46, 
56, 66, 76, 86, 96 ■ 




CC5 


Student talk initiation 


9/10 90 in Ln 


3 


CVC/STAP*? 


- ■ - ■ ' . 






CC6 


Asks for feelings 


^ ,_ . ■ 

3/lOb 




CC7 


Gives f eellnfts 


7/10a__^. 
13, 14, 15, 16/lb, Ic, Id 




CCS 


Disagreas or disapproves 


7 


CVC/OSCAR 








CC9 


Informs i facts 


5, 6/3, 23 




CCIO 


Informs ^ rules 


8/4,5,7. 




cell 


Accepts r facts and 
intarpretatlons 


9, 10/22, 32, 33, 42, 
43, 52, 53, 62, 63, 72, 
73, 82, 83, 92, 93 




CC12 


Accepts: feelings and 
plans 


11, 12/2, 12, 13, 19 




CC13 


Rejects^ facts and 
interpretations * 


13. 14/26, 35, 36, 45, 
46, 55. 56, 65, 66. 75, 
76, 85, 86, 95, 96 




CC14 


Rejects^ feelings and 
rules 


15, 16/6, 15, 16, 17 



37 



ERIC 



32 



Table 
4 



System 



CCO/OSCAR 



Ganeral Category Descripclon 



Category Numbers in 
Respective System 



CC15 

CC16 
CC17 
GC19 



STARS /OSCAR 
5 CC20 
CC21 



6 FLANDERS /ceo/ OSCAR 
CC22 

— - ^ CC23 



Asks questions 

Gives suggestion 

Gives direction 

Perfunctory agreement/ 
disagroemant 



Restructuring 
Tailing , informing 



Accepts f ealir*s* , prais^^s, 
encourages 

Criticizing J us tiding 
authority 



1, 2, 3/S, 50, 60, 
70, 80, 90 

6/9 

7/4, 5, 7, 17, 19 

14/14, 34, 44, 54, 
64, 74, 84, 94 



2b /7 
7a, 7b/ 3 



1, 2/10/2, 12, 22, 32, 
42, 52, 62, 72, 82, 92 



^6, 16, 26, 36, 
46, 56, 66, 76, 86, 96 



GCO #18 Mean ^ O^OO. 



3S 



ERIC 



Generallzabllity of Teacher Behaviors ' 
Across Classroom Observation Systems 

Dick Calkins, Gary D. Borlch, 
Maria Pascone, and C. L, Kugle 
The University of Texas at Austin 
. Soma reviewers of teacher effectiveness research (Borich, 1977a, brShavelson;' 
:&Atwood, 19771 and Shavelson & Dempsey, 1976) have suggested that a possible reason 
for past failures to find emplfically consistent relationships between teacher 
behavior and pupil achievement Is that the measurement process for quantifying both 
the product and process variables may be unreliable. Since reliability can be 
viewed as the extent to which a consistent rank ordering can be estabiished among 
subjects on some variable; by a particular measurement procedure, the discovery of 
potential process-product relationships could be complicated by the Inability to 
accurately rank order teachers according to behavior and students according to 
achievement by currently existing measurement techniques. In order to investigate 
the reliability of methods for quantifying teacher behavior, this study obtained 
data on the generallzabllity of the behavioral constructs measured by five class- 
room observation systems. The reliabilities of the items on these classroom 
observacion systems were examined via ienaralizabillty theory. 

Process-product researchers cominonly quantify teacher behavior on a few occa- 
sions and then generalize the average score obtained to all other occasions. To 
the extent that such a score is representative of the behavior for other occasions, 
the measurement procedure can be said to be reliable. Of the several- factors which 
can potentially cause a behavioral measure to be unreliable .variation due to the 
conditions under which the observations are made may be the most overlooked. 

Genera Gleser, Nanda, & Rajaratnuni, 1972), which 

is a combination and extension of classical test theory and Linguist's methods 
(Linguist, 1953) of multlfacet analysis of error, makes possible investigation of 
^^■^^^^^^^ of- various conditions on the values obtained from and reliability of ; ;{ 



a behavioral measure. According to generallzabllity theory any measurement can be 
thoughL of as a sample from some large set or universe of measurements of a par-~' 
tlcular characteristic. Such a universe consists of the set of possible combinations 
of conditions for which observations could be made. Conditions which vary along the 
same dimension and under which observations are made are called facets. If the 
variability introduced by sampling various conditions of a facet is small in magni- 
tude, then a measurement made for any particular condition is representative of the 
measurement obtainable for any other condition, and hence, the measurement obtained 
for one condition is generalizable to the entire facet. 

In the present investliatlon of classroom observation systems the facets 
sampled from the universe of facets were raters and occasions. The particular 
raters and occasions utilized constitute the conditions or facets of the study. 
The scores of Interest, quantifications of teacher behaviors on various dimensions, 
, J*^ ^ by ^^I^Sing scores for the^aim^ 

teacher. To the extent that the scores for a teacher behavior are comparable 
across raters and the scores for occasions comparable across occasions, the " 
teaching behavior is considered generalizable. To the extent that scores for a 
behavioral dimension are generalizable over raters and occasions, rank orderings 
among teachers on that dimension will be consistent and the likelihood of discover- i 
Ing relationships between pupil achievement and that dimension enhanced. 

°^"sions about the generailzablllty of a behavioral dlraension as quantified 
by a particular measurement procedure can be made with the use of a generallzabiUty 
coefficient which is the counterpart of the reliability coefficient in classical 
test theory. The coefficient of generallzabllity which is an Interclass correla- 
tion (Hays, 1973) is defined as the ratio of the universe score variance to the , r 
observed score variance. Like all interclass correlations it takes values between 
zero and one,"^and for this coefficient a value of zero Indicates total lack of \ .™ 
generallzabllity. 



& ; 

ERIC 



- ■ The mechanics of genetalizability theory concern analysis of variance compo- I 
;-\ nents, called facets. These variance components are then used to calculate gener- 
. .allzabllity coefficients and to suggest changes In a research design, such as 
Increasing the number of raters and occasions needed to obtain a particular level 
of generalizability. This' latter use of the generalizabillty coefficient xs 
similar to the use_made of the Spearman-Brown formula (Nunnally. 1967) in classical 
test theory. In addition, examination of the variance components of facets can 
reveal the necessity of making changes in the data collection procedures such as 
additional training of raters or chanies in the overall design suilh as including 
more facets, e.g., including a variance component which takes into account differ- 
ences between subject matter being taught, in order to Increase generallEability. 
The present study undertook to determine the extent to which the behaviors on 
five classroom observation systems were generallzable across two facets, raters 
and occasions. 

■ :t . ^ - Method : ^ 

„ The data were obtained from videotapes of 12 in-service junior high school 

.teachers, each teaching the same content, a unit in social studies, on three occa- 
sions of approximately 50 minutes duration. Each of 36 videotapes was rated .by 
five pairs of coders, each pair trained in a different observational coding system. 
For two of the five bbservational systems, coders were employed who had previously 
been trained by the authors of these systems, these being the two most complex ' 
systems. The remaining three pairs of coders were trained by the Investigators from 
^training materials supplied by system authors and from standard protocols from 
system manuals.. 

. .T^^ systems employed for this study were selected from 5?imon and Boyer's ^ 

Mirrors for Behaviors (1970) . The systems were (1) the Observation Schedule and 
Record, OScAR 5, (Medley & Mitzel, 1959), (2) Spauldlng Teacher Activity Rating - 
Schedule.. STARS (Spauldlng,. 1967) , (3) Flanders System of Interaction Analysis, - 



(Flanders. X971), (4) CERLI Verbal-Behavior Classification System. CVC (Cooperative 
Educational Research Laboratory) (Note 1) , and (5) the Classroom Coimnunlcatlon 
ObservatlonalSystem; ceo (Withal. Lewis & Newell, 1961). These systems were 
selected because of their availability to the educational research community (and 
therefore presumed use) and for the number of categories and associated operational 
definitions they had In common. ' 

■ completion of training, system coders, using their respective systems, 

rated three trial videotapes of the same general form and content as the experimental 
tapes in order to obtain estimates of inter judge reliability prior to the study. 
While reliabilities varied due to system compleKlty, all were deemed acceptable 
and are reported in Table 1 along with the median reliability for each pair of 
coders over all 36 tapes, . 



Inserj_ Table 1 about here 



The data obtained from the five observation systems were analyzed separately 
utilizing a computer program CErlich. 1976) which was designed to compute variance 
components and generallzabllity coefficients for a fully crossed teacher by 
occasion by rater design. All facets were assumed to be random. 



Results 



The results of this study are contained in Tables 2 through 7. Tables 2 
through 6 contain the category descriptions and generallzabllity coefficients for 
each;item on the five observation systems studied. T each observa- 

tion system are presented in a separate table. Table 7 summarizes the generallz- 
abillty results for comparable categories across systems.:^^ !^ 2 through 7 

items are considered generallzable If the generalizability coefficient exceeds .7 
for a combination of eight or fewer raters and eight or fewer occasions. Teacher 



-..behaviors which require more than these many- raters and occasions to obtain a . 

•/■■reliable estimate are usually incongistent and fluctuating, suggesting a need to 
redefine and/or reconceptualize thesfe variables. 

Consideration of Table 2 indicates that six or m of the 14 CCO items are 
generallzable. Consideration of Table 3 Indicates that six or 37% of the 16 CVC 
items are leneralizable. Consideration of Table 4 Indicates that two or 20% of the 
ten Flanders items are generalizable. Consideration of Table 5 Indicates that 13 
or 18% of the 74 OScAR lEems are generalizable. And. consideration of Table 6 
Indicates that six or 24% of the 25 STARS items are generalizable. Thus, of the 
139 total items In the five classroom observation systems, 33 or 24% were general- 



Izable. 



Insert Tables 2 through 7 here 



Consideration of Table 7 repeals that of the 11 behavioral categories composed 
of comparable items across the Flanders, CVC, CCO and STARS systems (no comparable 
items were found for OScAR) only three categories were generalizable in more than 
one of these systems. These were the praise and approval cate8qryJor_which the 
Flanders, CVC and CCO systems had generalizable items; the asks questions category 
for which the CVC, CCO and STARS systems had generalizable items; and the gives 
information category for which the Flanders. CVC. CCO and STARS systems had 
generalizable items. 

-> Discussion and Conclusions 

The results of this study indicate that fewer than one-fourth of the behavioral 
categories on the five classroom observation systems studied were generalizable with 
any combination of eight or fiwer raters "and eight or fewer occasions. It is dis- 
couraging to note that among process-product researchers these criteria may be con- 
sidered a particularly liberal standard of generallzability. generally exceeding 



I 



available resources. In addition, only three behavioral categories l<?gically ■■ ; 
comparable acroTs systems were generallzable, these behavioral categories being ■ 
£ralse and approval, asks quastlons . and gives information . Moreover, less than 
half of the behavioral constructs generalizable within systems (15 of 33) were 
generallzable In any other system. Including all combinations of two-system coin- 
:parlgons.- ■ ■ 

These findings support^the contention that either the scores obtained from the 
five classroom observation systems were not exhibiting those behavioral character- 
istics of Junior high school teachers which are generalizable or much of the teachei 
behavior as recorded by these classroom observation systems is, in fact, ungeneral- 
izable. In either case If the scores resulting from other "classroom observation 
systems are as unstable as for the five considered in this' study, the lack of gen- ' 
eralizability may be considered a tenable hypothesis for why so few consistent 
process-product relationships have been reported. 

■In addition to the conclusions above, several methodoloslcal issues are 
relevant to the generalizabllity of the behavioral constructs measured In this study 
The implementation of classroom observation systems In process-product research 
often proceeds by having a sample of teachers observed by a sample of observers •on 
a sample of occasions and cheir behavior quantified utilizing some observation 
instrument. Each teacher is then assigned a score for each behavioral category on 
the observation instrument which is usually the average of the ratings across 
occasions for that category. Such scores are considered representative of the 
behavior typically exhibited by that teacher, and thus are used as statistical 
estimates for purposes of data analysis. 

However, a problem arises with the accuracy of this statistical estimate when 
only the context In which it was calculated Is considered instead of all such con- 
texts relevant to the teacher's classroom behavior. The basis of this problem 
derives from the obvious notion that people behave differently in different 

44 ^ . . - 



.situations. The situations or contexts in which teachers behave differently may 
■ vary according to the nature of the students, the natiire of the school, the 
:■ subject matter being taught, the resource materials akilable, past training and . 

- experiences, as well as other factors. Thus, sltuatlonally determined variation 
in behavior can cause the determination of generali^abllity to be misleading when 
different situations or contexts which affect behavior are typically encountered 

- by the teacher but not considered as facets in a generallzability study. For 
example, when certain concepts are to be taught to students, one group of teachers 
because of past training and experience inay structure the learning situation such 
that lecturing is the dominant activity during a partlculirperiod (occasion 1) 
and discussion is the dominant activity during another period (occasion 2). But 
for a second group of teachers presenting the same content, the first and'-second' 
periods could both be spent In lecturing and discussion, equaling the amount of 
time spent In lecturing and discussing by the first group; If a researcher in 

.preparing a generallzabllity study falls to note such differences among groups of 
teachers and codes, say, the behaviors gives informatloh and asks questions with 
periods treated as occasions, neither of these behavioral categories are likely 
to be found generalizable. For a particular category of behavior to be general- 
izable from a design which considers raters and occasions as facets. It is necessary 
that the behaviors coded be consistently emitted and recorded for all occasions. 
Since only for the second group of teacher will the behavior being coded occur ~ 
over both occasions, the behaviors will not appear generallzable. Hence, the ) 
nature of the teaching situation may mitigate against the generallzabllity of a 
particular behavioral category with a design employing only raters and occasions as 

facets. ^ . 

An approach to-"sltuatlonally determined variation is to reconceptuallze the 
classroom context by applying generalizability theory to all facets thought to be 
relevant to the behavioral constructs under study. If "teaching situation" were 



.conceived as a superordlnate categorization of teacher behavior, such as giving 
Information, asking questions, providing reinforcement, etc., then each teaching 
behavior within such categorizations could be characterized by a score for each 
situation. Such situation specific scores oiiht be more likely found generalizable 
over raters and occasions and related to student achievement than overly simplistic 
context-free behavioral, constructs as presently conceptualized by some classroom 
observation systems. ^ ^^^^^ : . 



Refe rence Note 



CERLI Verbal-Behavior Classification System (CVC) . by Cooperative Educational 
Research Laboratory, inc By permission of Everette Brenlngmeyer . Cooperative 
Educational Research Laboratory. Inc., Northfield. Illinois. From a Report - 
to the Office of Education, U. S. Department of Health, Education and 
Welfare, Contract DEC 3-7-061391-3061, April 1969. 



Massachusetts: Addison-Wesley , 1977a. 
Borich, G. D, Sources of Invalidity in measuring classroom behavior. 

Instructional Sclenca . 1977b, 6 (3) . 
Cronbach, L- J,, Gleser, G. C. , Nanda, H., ^ Rajaratnam, N. The dependability 

oi behavioral measures; The theory of generalizabillty for ecores and profiles 

New York: John Wiley £r Sons, Inc., 1972. 
Erlich, 0. A study of the generalizablllty of measures of teacher behavior . 

Unpublished dissertation. University of California, Los Angeles, 1976. 
Flanders, N. A. Analyzing teaching behavior . Reading, Massachusetts: Addlson- 
Wesley, 1971. 

Hays, W. L. Statistics for the social sciences (2nd ed.) . New York: Holt, 

Rlnehart and Winston, 1973. 
Llndquist. E. .F.. Design and analysis of experiments In psychology and education . 

Boston: Houghton-Mlf flln, 1953. 
Medley, D. M., & Mltzel, H. E. A technique for measuring classroom behavior. 

Journal of Educational Psychology , 1959, 49,-227=239. 
Nui^nally, J. C. Psychometric theory . New York: McGraw-Hill, 1967. 1 
Shavelson, R. , a Atwood, N. Generalizabllity of measures of teaching process. 
In Borich, G, D. The appraisal of teaching; Concepts and process . 
Reading, Massachusetts: Addison-Wesley, 1977- 
Shavelson, R. , & Dempsey, N. Generallzability of measures of teaching behavior. 

Review of Educational Research , 1976, 46 (4), 553-611. 
Simon, A., & Boyer, E. G. (Eds.) Mirrors for behavior . Philadelphia: Research 

for Better Schools, Inc., 1970. „; 

48 : . ; 



p^l:;, v^v:;--; . ^ - ■ r::- ^ ^ ■■■■■■■■■ = ' • ; -■, ir 



Spauldlng, R. L. The Spaul dlng teacher activity rating gchedule (STMSV . 

Durham, North Carolina i Education Improvement Program, Duke University,. . 

1967. " ' • " , \,? - ■ ■ '. ■ ■.. .: ;'; 

withal,, J., Newell, J. M. , & Lewis, W. W. Development of classroom observational 
cateaorles within a communlcatloh model . Madison, Wisconsin: School -of . 
Idueation, University of Wisconsin, 1961. 




Table 1. Coder reliability before and during study.* 



System 


Highest Prestudy 
Reliability 


Median for 
36 Tapes 


STARS 


.. 72 


72 


OScAR 


m 


91 


FLANDERS 


.. \ - , - \ - , 91 


93 


CVC 


79 


82 


ceo 


■ 816 ■ . 


86 



Scott's coefficient. 




50 



13 



J Table 2 

Item dascrlptions and generaliEabillty 
coefficients for two raters and three 
. - occasions for CCO» 





ICem 


Title 




Coefficient 




i" 


1 
I 

2 


Asks Information : 




.1001 






Seeks or accepts direction 




.5839* 






3 


.Asks for opinion or analysis 




.4601 




51'- 




.-. • Listens ■ • . . 




.2498 






5 


Gives information 




.6403* 






6 


Gives suggestion 




.6958* 


- yyi 




L 


GlveBTtilrect ion 


._. .. ■ ■ . ■ .. . 


.4281 






8 


Gives opinion 




• u y J, w 






9 


Gives analysis 




.6956* 






10 


Shows positive feeling 




.7220 






11 


Inhibits comffiunicacion 




0 NR 






12 


Shows negative feeling 




'""^1-3621— 






13 


No communication 




0 NV 






1^ 


Prefun^^Cory agreement or disagreement 


.4646 






Note: 


Coefficients above . 7 without an (*) 
-raters -and— three - occasions ™ — - - 


are generalizable with two 





Generalizability of .7 was reached for some combination of 8 
> 9.^ f®^®r raters and 8 or fewer occasions 

NR represents the situation when no responses were coded for 
this item ' _ 



NV represents the situation when a negative variance less than 
-.2 occurred for this item 



51 



Table 3 

.... "... : . . ■ ." , .. -_ •. - •},.■• 

Item deacript ions "and prferallzability 
Goefficiants for two ratars and thrae 
occasions for CVG, 



tBta 



riP.la 



Genaralizabllity 
Coafflclent 



1 

2 

3 
4 
5 
6 

7 ' 
8 



Saaks fa^^-i.ii or spacJflc information 

Aaks for raa.^oning, explanation, Intarpretatlon, 
judgment i ov avaluatlon 

Asks icT : .^xingB, asks about faelings 

Asks ahou.'. y" .LtiB, plansp or diractlons 

Tells factual ^^n&VMfic material 

Gives raaaons, ik= -pratatlona , judgments or 
evaluation 

Gives or tells feelings 

Tells classroom structure, ruleSsdiractTone, 
plans 

Approvas factual or specific answers 

Accepts reasoned Ideas, interpretations, 
judgments 

Approves or empathizas with faelings exprassed 
Accepts or agrees *with plans, rules, directions 
Disagree with answer, or factual statement 



,2027 

,3133 
,6658* 
,4463 
0 NV 

,7188 
.4093 

.2835 
.6101* 

.3464 
,1020 
.4160 
.7760 



Disapprovea of tHlnking, interpretation, etc- 

Responds negatively to feelings expressed by 
others 



,5708* 



Rejects rules, plans, expectations, directions ,5437* 



Goefflcients above ,7 without an (*) are generalizable with two 
ratejfs and three occasions. 

. . . ^ ■ ' ,' . ■ ■ . - 

_ Generalizability of .7 was reached for soma combination of 8 
or fewer raters and 8 or fewer occasions 

NR represents the situation when no responses were coded for 
this item 

NV represents the situation when a negative variance less than 
- ~" _ =-2 occurred for this item - 



Tabla 4 

Item descriptlona . and generalizability 
coefficients for two raters and three 
-occasions for Flanders, 



Generalizability 
^£ia Coefficient 



Accepts feeling *0357 

Praises or encourages ,7686 

Accepts or uses ideas of student .1690 

Asks questions . q 

Lecturing . .8593 

Giving direction ^ . '0 NV 

CrlticlEing or Justifying authority .0974 

Student talk-response q 1^ 

Student talk-initiation V4764 

Silence or confusion ~ - : . v ^Q-urr^v 



Coafficlents above .7 without an (*) are generalizable with two 
raters and three occasions. 



GenerallEability of . 7 was reached for some combination of 8 
or fewer raters and 8 or fewer occasions : — " 

NR represents the situation when no regponae.g jj^rp coded for, 
; this item • "- i i 



NV ' represents the. situation when a negative variance iass than 
-•2 occurred for this item 



Ilvi'^" • descriptions and generalizabiiity 

#::v^> - - ' r coefficients for two raters and three 



occasions for OScAR. 



Sfit • ■ 'i 



13 
14 



24 



Item T^^y-, Genaralizability 



Pupil non-substantlva/critlcized 
Pupil procedural question/negative 



Pupil question/not evaluated 
Pupil queEtlon/eupportad 
22 ^ Pupil quest lon/approvad 

Pupil quastion/acknowledged 



.3881 
.5865^ 
0 NV 
.0268 
0 



1 Considarlng statamant 

2 Informing statement 

3 Describing statement 
^ Diracting statement 

5 Rebuk'lng statement 

6 Dasisting statement 

.2496 

7 ■ ; Non--substantlve question - 
^ Procedural questlon/posltlva 

^ ; ^"Pil non-=substantlva/utteranee 

P^pll non-aubstantive/not evaluated 
Pupil non-substantive/supported 
^2 ■ Pupil non-subatantive/approved 

Pupil non-subatantive/ackndwl edged 



Pupil non-substantive/neutrally rejected 0 



,0250^ 
,7133 
0 
0 

.6652^ 
.7293 
0 



0 



0 



Pupil questlon/nautrally rejected 0 
"25 . Pupil question/criticized 

^6 Pupil statement 

;:27 ;: ; : -..i ^ avaluated 
.28- : J Pupil statement/supportad^ 

2? - Pupil statemenc/approvad ;v 

" Pupil Statemenc/acknowledged 



0 
0 

.0038 
0 



0 NR 
0 

.0198 
.8549 
,0119 
0 



I^.J; Pupil procedural quastion/neutral .0978 



Pupil procedural question/positive 0 
19 V Pupii question 

rzoT: — 



Table 5 (cont.) 



: Item 

31 
32 
33 
: 34 
35 
36 
37 
38 
39 
, 40 
41 
~ :A2 - 
,43 
44 
45 
46 
47 
48 
49 
50 
51 
52 
53 
54 
55 
56 
57 
58 
59 
60 
61 
62 
63 
64 



Title 



Generalizablli 
Coeffiaient 



Pupil statement/neutrally rejected 
Pupil statement/criticized 
Pupil response 

Pupil response/not evaluated 
Pupil response/supported 
Pupil response/approved 
Pupil response/acknowledged 
Pupil reaponse/neutrally rejected 
Pupil reaponse/criticlEed 
Problem structuring/statement 
Choral response 
Choral responsa/rfupported 
Choral responsa/approvad 
Choral reaponse/acknowledged 
Choral responae/neutrally rajactad 
^ Choral response/criticized 

Convergent quast ion/not answered 
Convargent interchange/not evaluated 
Convergent interchange/aupported 
Convergent Interchange/approved 
Convergent interchanga/acknowladgad 
Convergent interchange/neutrally rejected 
Convergent Interchange/critlciEed 
Elaborating question/not answered (1) 
Elaborating interchange/not avaluated (1) 
Elaborating interchanga/supported (1) 
Elaborating interchange/approved (1) 
Elaborating interchange/acknowledged (1) 
Elaborating interchange/neutrally rejected (1) 
Elaborating interchange/criticized (1) 
Elaborating question/not answered (2) - 
Elaborating interchange/not evaluated (2) 
Elaborating interchange/aupported (2) 
Elaborating interchange/approved (2) 

55 



.1896 
0 NR 
0 

0 NV 

■:" ;6623* 
,3718 
,2019 
0 

0 NR 
;3071 
.5257* 
0 

• 2034 
.4573 
.6751* 
0 NR 
,0431 
. 2081 

0 

0 

0 

-0433 
0 NR 
,1562 

0 

.2479 
0 

,5074^ 
0 

0 NR 

.5519* 

0 

.5499* 
.4514 



18 



Table 5 (cont.) 



Item 


Title 


ueneraj.iEaDiJLi t_y 
Coefficient 


65 


Elaborating interchange/acknowledged (2) 




66 


Elaborating interchange/neutrally reiected 




67 


Elaborating intarchange/crltieized (2) 


0 NR 


AS 


Divergent question/not answered 


0 


69 


Divergent interchange/not evaluated 




70 


Dlvargent Intarchange/supported 


.5769^ 


71 


Divergent Intarchange/approved 


• 0914 


72 


Dlvargent interchange/acknowledged 


0 


73 


Divargant intarchange/nautrally rejectad 


0 


74 


Divergent interchange/criticized 


0 NR 


Notes 


Goefficients above ,7 without an (*) are ganerall 
raters and three occaalons. 


zable with two 



Oeneralizabillty of .7 was reached for some combination of 8 
or fewer raters and 8 or fewer occasions 

NR represents the situation when no responses were codad for 
this item . ^ . . 

NV represents the situation when a negative variance less thnn 
-.2 occurred for this time 




Table 6 



Itam descriptions and generaligability 
eoafficlenta for two ratars and three 
occasions for STARS* 



Item 



1 
2 
3. 



5 
6 
7 
8 
. 9 
10 
11 
12 
13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 



Title 



Ganeraligabillty 
__Coaf fie lent 



Note: 



Non-transaetional behavior 
Disapproval with averaivestirauil present 
Disapproval Indicated by removal of social 
reinforcers or, in some cases , physical reinforces 



^0 
0 



NR 
NR 



Withholding reinforcers when a student or child 
bids for attention 

Approval with positive affect present 
Social and/or motor structuring 
Social and/or motor restructuring 
Digressions 

Inductive methods-prasenting simple facts 

Inductive methods-complex concepts 

Concept formation-^slmple facts (deductive) 

Concept formatlon-cQmplex facts (deductive) 

Concept formation-simple or complex events 
(transductive or analogical) 

Telling^simple facfs"' 

Telling-compleK facts 

Rote process-simple 

Information 

Focusing attention 

Asking for or eliciting recall-simple 

Asking for or eliciting recall-complex 

Asking for use or application-simple 

Asking for use or applicatlon-compleK 

Express ing-values J opinion, feelings 

Eliciting student expressions-values J opinions , 
feelings 

Listening to or observing non-teached directed 
pupil activity 



:i917' 
,4103 
^2204 
.0755 
.6173* 
0 NR 
0 
0 

*5782* 

0 NR 

.7106 
0 NR 
0 

,4719 
,3616 
,5247* 
0 

.1968 
*1991 
.7221 

.3010 

.6382* 



2^f"^"fL^^''''® with two ' 

raters and three occasions, - 

Generallzabillty of .7 was reached- for some combination of 8 
or fewer raters and 8 or fewer occasions • 
m represents the situation when no responses were coded for this item- 
NV represents the situation when a negative variance less than --2 '^^ 
occurred for this item -^ i-nau^ . 



Table 1 



, Geniraliiability coefficients for two raters and three 
occasions for logically comparable iceii for thi Flanders 
CVCj ceo, and STMS observacion systems. 



Behayioral Catep ory item 



Approval of feelings 1 .O357 
Pralae and approval ■ 2 .?686 



Accepts or uses student's 

ideas :. ■ 3 ^^^^ 



Asks questions -4 0 • i 



/es inforniatian 5 ,3593 ^ 



12 ,4160 



.2027 

2 .3133 

3 .6658* 

4 .4463 



0 



6 .7188 
1 .4093 



Itei 


CVC 


Iteni 


ceo 


Item 


STARS 
.4103 


11 


.1020 


10 


.7220 


5 


9 


.6101* 


10 


.7220 


5 


.4103 


10 


.3454 








11 


.1020 










12 


.4150 











2 


,5839* 


5 


.4103 


1 


.1001 


8 


.6173* 


2 


.5839* 


9 


0 


3 


.4601 


10 


0 ■ 






19 


.5247* 






20 


0 






21 


.1968 






22 


.1991 






24 


.3010 


5 


.6403* 


11 ■ 


0 


e 


.6610* 


12 


.5782 


9 


.6956* 


13 


0 






14 


.7106 






15 


0 






16 


0 






17 . 


.4719 






23 


.7221. 



58 




'able 7 (cont.) 



jehavioral Category 
■Givis. direction 




Disapproval 



NOiCODiunicatlon 

^tnforniatioMl faedbick- 
.nigative 

IgMrlng : ■: 
Listins/obiervas 



1^ .5?Q8* 

15. 0 , . 

16 -,5437* 



10 0 



12 . .3621 



13 0 



13 



■11 0 
4 .2498 



6 ,2204 



'2 0 
3 0 



are geniraliiable with two raters and tkai occasions,^ 



Generalizabillty of .r was reached 



1 0 ■ • . 

25 ,6382* 

3 0 

4 .1917 
25' .6382* 



for iome coibination of 8 or fewer raters and 8 or fewer oc 



casions. 



N 




61 



Measuring Classroom Interactions i How Many Occasions 
are Required to Measure Them Reliably? 



Oded Erlich Gary Borich 

Tel-Aviv University and . The Univeraity of Texas 

^""1 at Austin 



One important line of inquiry in research on teaching has operationaily 
defined teacher behavior variables and examined their relationships to 
student achievement. This approach assumes that the individual teacher plays 
a key role in producing student learning. However, results from correlational 
studies of teacher behaviors and student outcomes have been disappointing with 
most correlations low or nonreplicable (Borich, 1977; Shavelson and Atwood, 1977). 

Shavelson and Atwood (1977) in a recent review of current research on 
teaching, hypothesized that one possible reason for the lack of relationships 
between teacher behaviors and student achievement is that the generallzabllity 
of behavioral measurements has not been adequately examined or established to 
allow conclusions about ^relationships between teacher behavior and student / 
outcomes. Measures of teacher behavior contain potential sources 
of error (facets) such as observation occasion, observers, and subject matter 
which might -affect their generalizability. The generallzabllity of measures 
Is not a function of these separate facets, but rather a function of the 
simultaneous consideration of all the facets which might affect the generallzablllty 
of the measures. The effect of these facets generalizability of teacher 

behavior can be es t^^^ theory 

(Cronbach,Gleser, Nanda, and Rajaratnam, 1972). In generalizabillty theory 
a genaralizability study (G study) is conducted which has two purposes. The 
first .is to examine the generaligability of teacher behavior measures by 



Measuring Classroom Interactions 

2 

considering the measurement facets (e.g. .occasions and raters) which affect 
. the reliability of measurements obtained. Based on this analysis, a G study 

then recommends variables for Inclusion in future decision studies (D studies) 
which examine relationships between teacher behaviors and student outcomes. 

Only a few studies on the generallzability of teacher behavior measures 
have been reported. They have either explained how to apply leneralizabllity 
theory to examine the problems in measuring teacher behavior (e.g.. Medley 
and Matzel, 1963; McGaw, Wardrop. and Bunda, 1972; Mar zano, 1973; and Rowley, 
1975), or they have failed to use appropriately the data available (e.g., 
Sandoval, 1974) . 

Since generallEabillty studies were not available, Shavelson and Atwood 
explored the measurement problem by examining studies which differed in 
measurement facets, but which used similar teacher variables. They attempted 
to find patterns In the stability of teacher behavior measures across these 
studies. Although they could not determine the amount of error cr-ntributed 
by various facets in separate studies, they reached several tentative 
conclusions about the stability of 13 clusters of teacher behavior variables. 
Global ratings were found to be the most stable measures while variable 
clusters of teacher presentation, positive feedback, probing, and direct 
control were found to be moderately stable. Unstable variable clusters 
Included teacher questions, negative feedback, nonprobing behaviors, Indirect 
control, and student-centered teacii'ng style. 

A recent generallzability study (G study) examined patterns of error 
sources contributed by the facets of raters and occasions in order to identify. ; 
generalizable measures of teacher behavior (Erllch, 1976). Erllch analyzed . 
data collected by Sandoval (1974) which included frequency counts as well as^ 
global ratings on five 5ch irade teachers observed by two or three raters •• 
^bile teaching reading, and mathematics on three observation occasions. Erllch 



Measuring Classroom Interactions 
■ 3 



defined a meaBure of a varlabla to be generallzable If It required a cofflblnatlon of 
4 or lesB raters and 10 or less occasions to' reach a coefflciant of generaUzabilicy 
of 0.7. Variables were classified into three groups: (1) low frequency of 
occurrence (ultimately excluded from analysis), (2) high frequency variables 
whose measures appeared not to be generalizable, and (3) high frequency 
variables whose measures appeared to be generalizable. The teacher behavior 
variables found to be potentially generalizable In his analysis aupported most 
of the conclusions reached by Shavelson and Atwood concerning the stability of 
these variables. The one exception occurred in the cluster of variables ralated 
to teacher questions. Shavelson and Atwood found all questioning behaviors to 
be unstable, but Erlich found that those questioning variables related to ways 
of checking student reactions and learning were generalizable. 

The purpose of the present study was to examine the genarallzability of 
measures of classroom interaction occurring during 2nd and 3rd grade class 
activities excluding reading instruction and to provide information concerning 
Che number of observation occasions required to reach a 0.7 level of 
generalizabllity for each of these measures. 



Method 

The data analyzed in this study were collected during the second year of 
a two year replicated study of teacher effactlveness using the Brophy-Good 
Teachet-Chlld Dyadic Interaction System (Brophy and Evertson, 1976). Subjects 
were 28 teachers who had 5 or more years of teaching experience with their 3 
most recent years of experience at the 2nd or 3rd Brade level. These teachers 
. were se J ec ted because they had shown high consistency in producing student 
learning gains on the Metropolitan Achievement Tests. They were observed 
^?tween 9-14 times during whole class activities and reading Instruction by 
two different raters who alternated across occasions. 

. -64 - ■ ■ - 



o ■ 
ERIC 



Measuring Classreoio incetactlons 

4 

taachers were eliminated from our analysia since almost no data were 
recorded for them on one or both of the two main categories of variables In 
the Brophy-Good SysteEn-publlc response and private response varlables-durlns 
nonreadlng class activities. Also, although teachers were observed a total of 
9-14 times, about half the occasions Included only reading group Instruction, 
leavlna 3-7 occasions In which data were collected on dyadic Interactions 
other than reading. Therefore, the sample was reduced and data analyzed 
consisted of 17 teachers on 5 occasions for the category of public response 
variables and 22 teachers on 5 occasions for the category of private response 
variables. .. 

The design selected for the analysis was a one facet nested desist 
occasions being nested within teachers. Occasions ware considered to ba nested 
because taachers were observed at different times of day. on different days 
and teaching what may be considered different lessons. 

Even though an Implicit source of error, raters were not considered as a 
potential source of errorln this analysis for several reasons. First, all 
racers had extensive training during the first year of the study and during the 
summer prior to the second year of the study, enabling them to consistently reach 
at least 0.8 agreement. Furthermore, the criteria for agreement required chat - 
raters achieve a 0.8 reliability not only for their codes in each general category 
of behavior in Che observation system, but also for frequency counts on clusters 
of variables within each category. Disagreements between raters were most of ten 
a result of one rater being able to code more- Information than another, and. 
therefore, the rank ordering of the teachers was not affected. This implies that 
there was minimal teacher-rater interactioni and therefore, raters were considered : 
not to be a potential source of error affecting the generallzablllty of the 



The T^cher=Chlld Dyadic Interaction observation instrument attempts to code 



Measuring Classroom Interactions 



5 



all dyadic Interactions (teacher behaviors with respect to an individual child 
as well as the child's responee and Interactions with the teucher) occurring 
in the classroom. It contains 167 variables divided into two main categ'orlesj 
public response variables, in which the teacher-child interaction occurs in a . 
group setting; and private response variables, in which the teacher and child 
confer privately about the child's Individual work. Within these two categories 
of variables, Brophy and Good Identified clusters of variables. The public 
variables included the following clusters: Teacher's Method of Selecting 
Students to Respond; Difficulty Level of Questions; Type of Questions Asked 
(Academic or Nonacademic) ; Quality of Student Response to Questions; Teacher's 
Feedback Reaction to Student Responses; Student Initiated Comments; and Student 
Initiated questions. The private Interaction variables were divided into 
three clusters: Child Created Contacts (CCC) ; Teacher Afforded Contacts (TAC) ; 
and Behavior Related Contacts, 

Generalizabllity theory was used as' the statistical basis for the data ' 
analysis. For each variable, the analysis provided the estimate of the universe 
score (true score in classical cest theory) variance [ ?(t) ], and the estimate of 
the error variance, which in this design was due to the teacher occasion 
Interaction confounded with the occasion variance and •unidentified sources of 
error [o(o, to, e) ]. Based on these variance components, the number of occasions i 
required to obtain a generallEabUity coefficient of 0.7 was calculated for each 
variable. A generallzable variable was defined In this study as one for which ^ 
a coefficient of generalizability of 0.7 could be obtained by observing the 
teacher on ten or fewer observation occasions. Not only is ten a practical 
upper limit on the number of observation occasions which could be used, but also, 
and of greater Importance, teachftr behaviors which require more than ten occasions 
to obtain a reliable estimate are usually inconsistent and fluctuating, suggesting 
a need to redefine and/or reconeeptuallze these variables. 



Measuring Classroom Interactions 



Reaulta 



Initial inspection of the data revealed that a majority of the variables 
occurred infrequently. Two types of low frequency variables were Identified..: 
The first type of infrequent vartable consisted of variables for which the 
frequencies were •scattered throughout the teacher by occasion matrix, i.e. , 
only a small number of teachers engaged In these dyadic Interactions, the 
frequency counts for these interactions were uniformly low and were obtained 
on less than three occasions for each teacher. This type of infrequent variable 
was eliminated from the analysis since these frequency counts were too 
inconsistent, coo low, limited to too few teachers, and would have demanded ' 
a very large number of oecaslons to reach an acceptable level of generallzabillty . 
Two entire clusters of variables-=Student=lnltlated Questions and Student-Initiated 
Comments— and one sub-cluster— Opinion Questions— were completely eliminated 
by this process. Brophy and Evertson (1976) suggested in their analysis that 
these types of interactions may be inappropriate for teaching fundamental tool 
skills In the 2nd and 3rd grade. The other low frequency variables of this 
type which were also eliminated were spread throughout the remaining variable 
clusters. In general, these variables appeared to be infrequent because of 
the detailed nature of the observation Instrument which attempts to allow for 
all possible Interactions even when their occurrence is not probable (e.g., 
praise after a wrong answer or criticism after a right answer) . 

A second type of low frequency variable was retained for analysis. These 
.low frequency variables were recorded for relatively few teachers, but occurred ■ 
more consistently. These variables, although occurring infrequently, may be " ' ' 
generalizable, and, if such Is the case, should be Included in eorrelatlonal - 
studies of teacher behaviors and student outcomes. 

. Table 1 presents the results of the analysis for the public response 
variables . Variables ate grouped Into, four .clusters based on those / . . ,. L 



Measuring Classroom Interactions 

7 



developed by Brophy and Good. Each variable cluster is discussed separately. 
For each variable che table includes the estlnates of universe score variance 
[ oZct) ] and error variance [ a2(o,co,e) ] and the number of occasions required 
to reach a 0.7 level of generalizablllty. 



INSERT TABLE 1 ABOUT HERE 



The first variable cluster. Teacher's Selection of Respondents, describes 
the way in which the teacher selects students to respond to questions asked. The 
teacher may either preselect (name the child who is to answer before asking the 
question), select a child from among those who volunteer to answer, or select a 
nonvolunteer. If a student gives the answer before the teacher has time to 
select a student, this is labeled a "call=out." Relatively few occasions, three, 
are needed to obtain a reliable measure of the frequency of call=outs. Teacher 
"selection of a volunteer" and the "preselection of a student" to respond are 
generalizable. but these variables require more occasions, five and eight 
respectively, to reach a 0.7 level of generalizabillty. The last variable. 
"selection of a nonvolunteer ," requires twelve occasions and is nongenerallzable. 
if we use our earlier criterion of ten occasions as the practical upper limit 
of the number of occasions that are possible. 

The next cluster, Type of Questions, contains variables related to the 
type of questions asked. "Product" and "process" questions represent difficulty ^ 
levels of academic questions. To answer a product question, the child must give : 
a specific correct answer which can be expressed in a single word or short 
phrasft. The process question, which is more complex, requires the child to 
explain the steps which must be followed to solve a problem or reach a 



Measuring classroom Interact iqns. 

8 



conclusion. "Math questions" do-not differentiate between the difficulty level 
of the questions, but Include all questions related to math content. The last 
r^'two question variables, although both subject-matter related, are considered 
nonacademlc questions. These are called self-reference questions because they 
are not intended to elicit a particular correct factual answer, but ask the - 
child instead about some. factor in his personal background. 

Three variables in this cluster were found to be generalizable. "Math 
questions" and "subJect-maCter-related self-reference questions about the 
student's experience" can both be estimated by the use of three occasions. 
"Product questions," the type found to occur most frequently at this grade level, 
require six occasions. The two remaining question variables, "process q-uestions" 
and "subject-matter-related self-reference questions asking a student's 
preference," are "nongenerallzable, requiring 17 and 48 occasions, respectively. 

The third cluster. Quality of Student Response to Questions, evaluates 
studant answers to questions. Four variables .were ^considered: "correct" and 
"part-correct," "wrong," and "no response." The number of "correct" and "wrong" 
answers can be estimated by using six occasions, and the number of "no responses" 
by using ten occasions. However, measurement of the variable "part-correct 
responses." requires twelve occasions, and is therefore nongeneralizable . 

The last cluster involves public response opportunities and contains 
variables of Teacher Feedback Reaction to Student Responses. Three types of 
feedback which occur after a correct student response: "praise," "process 
" feedbaS" 7explTlning the' process TnvoiT^^^ "ln~Fea^hiSg"^ th^^ cor rldfi- answer ) , and" 
"asking a new question" are generalizable, requiring two, two, and four occasions 

- respectively.. Teacher feedback which "affirms the answer" following a correct 
' response is longeneralizable, requiring the use of 12 observation occasions. 

" three remaining types of feedback occurred after either a wrong answer or 

- - a no response. _A11 are generalizable with "asking another student" requiring 



- ^\ Measuring Classroom Interactions 



WM^^L^-''^^ °^ """^ occasions, "rephrasing or cluing after a wrong answer" requiring 
|i|C-.V: - eight .occasions, and "asks another student" requiring four occasions. 



1:^ 



■. . - ■ ■ Table 2 presents the results of the analysis for the private dyadic ' ' 

.interaction variables. In these interactions, the teacher deals privately with 
-one child about matters idiosyncratic to the child. These interactions may ''be 
"work-related" (giving or asking for help with class content or procedures) , 
"personal" (giving or asking for personal information or for favors), or "behavior- 
related" (classroom behavior). The personal and work-related interactions are 
divided Into two clusters: "Child Created Contacts" (CCC) and "Teacher Afforded ,■ 
Contacts" (TAG). The "Behavior-Related Contacts" (which are aiT teVcher afforded)"! 
^^.^ ^^"^'^ered separately Co corres pond to the Brophy -Good classification. 



INSERT TABLE 2 ABOUT HERE 



The first variable cluster, "Child Created Contacts." contains more variablea.:. 

than any other cluster. : Not only does there appear to be many child created " .;f 

dyadic interactions at . the 2nd and 3rd grade level, but most of these interactions.- 

are generalizable. Only three variables are nongenerallzable: "content-related 

occasions), "content-related CCC giving long feedback" (24 occasions), 

work-related procedural CCC which were delayed" (15 occasions). Eight of 
^{^^y^i:^'{ ~ 
g>^;"..the remaining eleven vafiables can be estimated reliably by the use', ot. four or 
i5<l>''-- "•' c ' 1' ■ . ,- 

?pr—---fewer occasions, suggesting' that many child created' contactT a^riilihli; ' " 

consistent behaviors at the 2nd and 3rd grade levels. .-^ 

-=-The second variable cluster, "Teacher Afforded Contacts"- (TAG) is composed ' ; 

g:':;- of-^P^iV'^te interactions Initiated by the teacher. Most teacher aff of dtd contacts ' '1 

^^""^ """"^ related. The total TAG "math contacts" can be estimated reliably by 

j'^^®-,:"5^ °^ '^^'^^^ observation occasions, while TAG "work contacts with long . .i 

: --feedback" and TAG "work contacts with brief feedback" require 5 and 7 occasions ' ?! 



01^^^^-, ;■• ' t' ■ Measuring Classroom Interactions 

'-^fSP^/fe-.---^ .1 ^ 10 



m. 



W^-'::^"^"'^^ ^ coefficient of generalizability. TAG "work contacts 

Seneralizabl. (6 occasions). However. TAG "work 
I©-.-"' ■^'^"'^^^'^ Involving observation" require the use of 17 occasions and are 
li-V ■ -nonganaralizable. The other teacher afforded private interactions in this 
P , . cluster, involved "procedural management contacts" or "personal contacts." The 

I":/" "^^^ °"1>' 3 occasions and was generallzable, while the 

^^"^^ nongeneraliaable, needing 12 occasions, 

pSfV ^°"»^^"-> contains 5 variables. The ■ 

first 3 describe types of teacher reactions to student misbehavior. "Teacher 
criticism" can be estimated reliably by the use of 3 occasions, while "teacher : 
'l?£?iHMland.2nsnv^^^^^ 

The last 2 variables assess the teacher's handling of behavior-related contacts. 
Contacts in whTch "no teach.r. errox" occurs can be estimated reliably by the 
use of only 2 occasions, and contacts involving "teacher overreaction" require 
7 occasions, " " ■ " • ■ • • ~ .v: . 

In sumniary, the findings indicate that many public and private variables 
can be considered as reneralizable If measured by the required number of 
observation occasions. On the other hand, many other variables obtained such 
low frequency counts that they were excluded from analysis and are considered 
to be nongeneralizable. The large number of infrequent' teacher=child dyadic 
interaction variables leaves open the possibility that dyadic Interactions may , 

at least at the primary level. Classroom observations at higher grade levels . . 
might still show that some of the infrequent variables eliminated from this " 
analysis do occur more frequently and/or consi8te_ntly^,at_these_levels. If. 
such is the case, these variables should be analyzed to determine their 
generalizability within the framework of these higher grade levels. 



Cone lus long / : 

-This study shows that the gencrallzabllity of behavioral measurements 
must be an important consideration In classroom observation res earch ' This 
analysis has identified generalizable measure-- of classroom Interactions at 
the 2nd and 3rd grade level and determined the number of observation occasions 
required to reach a 0.7 level of generali.abillty . It should be recalled that, 
raters were not considered as an error source in this study because extensive ^ 
training of raters and the stringent criteria for a priori inter-rater 
agreement ensured a high inter-rater reliability and a minimum of teacher-rater 
interaction. When observers are trained to an appropriate level as in the 
Tcxas_Te^h_er Ef f ec Wveness_SMd^ 

eliminated as a source of error. Otherwise, raters as well as occasions 
must be considered as potential sources of error affecting the generalizabllity .'^ ^ ^ 



of the measures . 



Past classroom observation studies have often used three or fewer 



observation occasions to measure teacher behaviors. This itidy'fSuid that 
many variables must be measured by more than three occasions to obtain a 0.7 
level of generallzabllity. Future studies collecting classroom observational^^J/S^ 
data in the lower grade levels using the Brophy-Good system or alMiJ^^ sysie^^ ■ '^'^'M 
■ measuring these same behaviors should rely .upon the findings 'giii^W:^ ^^^-^M 
the number of observation occasions required to estimate the ^ measures ^eii^bl^^^S^ 
apparen t tha t generalizable classroom measures -req'uire dlfferan^^^iiljJ^^ 
numbers of occasions to be measured reliably. Therefore, it seems appropriate '^^-^^^M 
to recommend that the generallzabllity of different measures obtained' by other ^ 
.o^vation systems be examined in order to determine how to measure them reliably ; j'/.g 
-Obtaining reliable -measurements will enable researchers to eliminate sources ' "fl 

of measurement error which may be contributing to the lack of relationships . i| 

-between classroom interactions and student outcomes 



"5,, ^ 



Measuring Classroom Interac tlOnS»« 

12 



Table 1 

Estimate of Universe Score Variance and Error Variance, and Number of 
Occasions Required to Reach 0,7 Level of Generalizability 
for Public Dyadic Interaction Vari^^ 



Subjact-matter-ralated self-raf arance 



Number 



A^.,.....^ , . question asking student's prefarance 


2.27 " 


16.94 I 


' 17 


guAJLiix ut bXUDbJNi RE^^FONSE TO QUESTIONS ^ ' 


-Corraec 


117.94 


289.69 : 


6 


f . Wrong 


' 4.87 


12.99 


6 


-y:--:r- No Respdnsa 


3.84^ 


^ 14.96^^ 


9 


Part-corract 


0.79 


3-72 


12 




- 1. 











0 ( t) 


0 (o,to,e) Occasions 


V 


TEACHERS^S SELECTION OF RESPONDENTS . 








fe- ■ - 


Call-outs by student 


4.64 


5.83 


3 




Selects Voluntaer 


65.21 


127,03 






Preaalacts student V j . v . 


14.13 


66 . 73 


8 




Salacts Nonvoluntear. 


73.07 


237,96 


12 




TYPE OF QUESTION 








Academic Quastions 










Total math responsa QppQrtunlties 


175.23 


241,59 


3 




Product questions 


135.62 


373,89 


6 




Process questions 


0.54 


11.03 


48 




^ Nonacadamic Quastions 










Subject-matter^ralated salf =»raf aranca 




. ..... } 


\ ■ \ ■ ■ » -T- 

r ■ 




question about the. studant 's 










? expariance 


A--P8 


7.37 


^ 3 



73 



erJc 



Process feedback 
Praise 

Asks new question 
Affirms answer 
Following Wrong Answer 



Rephrases or clues 
Folic Ing No Response 
Asks another student 



1,07 
16.72 
11.70 

5.04 



0.36 



2,23 



0,81 
16.50 
22,73 
24.72 





Asks anothpT- scuHpnt 


.^^3 










-2:63- 



1,26 



4,24 



2 
2 
4 
12 

5"^ 
8 



J 




Table 2 



SSlf®^^^!^^^ ^"^"^^^ Variance and Error Variance^i'Tand^N^er^^f ^^^^ 

Reach 0,7, Level . of . Generalizability 
or - Private Dyadic Interaction " V^01^pii:^-^^;M^ 



PfftGHilipr'CHEATED CONTACTS 
t^gi;SlgW6r^^^^ Interaction 
'Content 



"2 

a (t) 



Total Math child created woTk contac 






















Content- 


-related 


CCC 


given 


brief; feadback 






Content- 


■related 


CCC 










Content- 


relacad 


CCC 


given 


long feedback 






Procadural 

























WW' 



5,09 
9.12 
1.28 



Work-related procedural CCC - 13-57 

Work-related procedural CCC with Criticism 0.09 

Work-related procedural CCC with Praise 0.04 

Work-related procedural CCC which ware delayed . 0;05 

Work-related procedural CCC with brief feedback 15.33 
Personal Interactions ' 

CCC personal experience sharing interactions 2.38 

CCC personal procedural interactions . 6.41 
CCC personal procedural interactions which were 



, ■ 48.37 ' - U.^^^'S^m 



12.94 

15.45 
0.14 
0.12 
0.29 

11,27 

4.75 



2 
' 4 

8 
15 

2 

'-4:?;. 



. ;->r.r'- t'ly^'A 

mm 



granted 



2.04 



^ CCC personal proceduraL interactions not granted 0.98 



12.08 

6.31 
1.38 



75 



ERIC "^^^^ 



•--V ■,:-:jv;-n.>?1 



Table 5 (continued)' 



TEACHER AFFORDED CONTACTS 

Total Math teacher affordad work cont^^ 
Work contact with long feedback 
Work contact with brief feedback 
Work contact Involving obaervatlon 4 ^ 
Work contact involving criticism > 
— Pr Qcadura^inanagemen t— con^ac tg — 

Personal contacts 
BEHAVIOR RELATED CONTACTS 

Teacher criticism 

Teacher warnings • . . : 

Nonverbal in tarvention 
No teacher error 
Teacher overreaction 



Measuring Classroom In teractidns= 

15 



"2 

o (_t) 



4.40 
16,77 

0152 
45.32 

0,22 



a^(Q, to, a) 



Number 
of 

"Occasions 




5.57 
27,94 

2,07 
43,38 

0.70 



3 
4 
9 
2 
7 



-■■ -; 

Zi: 



ERIC 



76 



Measuring Classroom Interactlbris 

: 16 



Ref eren 



:Borlch, G. D. Sources of Invalidity In measuring classroom behavio^. 

; ., Instructional ScAnr^n^^ 1977^ in press. 

Brophy, J. E., . Evertson. C M. Learning from taachr... a 

P!l£S£ective. Boston: Allyn and Bacon, Inc., 1976. - 
Cronbach.L. J., Gleser, G. C. . Nanda, H. , & Rajaratnam. N. The dependability 

of bahavioral me agurementsi.., T^^^ theory of ,en.r.j^^^K^v^^^^^^ 

^nd^roflles. New York: John Wiley, and Sons, Inc., 1972. 
Erlich, 0. A atudy of the ^ en.r.H . ability of mea.ur.« 

l,npubi±-^hed-do-c„ral"riire?fit±5irru^^^ IgTe. 

PaSiaining the reliability ^ Qh^piatar instructional 

sjr^ Btem observation instrument . Unpublished .thesis. ^University of 

Illinois, 1973. 

.McGaw, B., Wardrop, J. L.. & Bunda, M. A. Classroom observation schemes. 
Where are the errors? American Education Research Journal . 1972. ~ 
1(1), 13-27. 

Medley. D. M., . Mitzel, H. E. Measuring classroom behavior by systematic 

observation. In N. L. Gage. Handbook of research on teachinp. Chicago: 
Rand McNally, 1963. 

Rowley. G. L. The reliabilities of classroom observational 

, . Estimation, Interp retation ... applicatlon --^nnn,.^H ^-..^.^f ^-^^ - ■ 
dissertation. University of Toronto, 1975. ' 
Sandoval, J. Beginnins teacher evaluation study comnlet-,- on ..p.^. , 

■ -£^gj^-Kx_Sub-study o f ^consistency of- teacher behavior . - -pfln>"&,-nn ' 
New Jersey and^Jerkeley. California: Educational Testing Service. 1974.' 
Shavelson. R., , Atwood, N. Generali.ablllty of measures of-_teaching 
. process. In G. D. Borich, The^EEraisal of teachin.: nnn^.p., .- 
- ; process. Reading, Massachusetts.- Addison-Wesley, 1977. 



w 



Generaligability of .Teacher Process Behaviors 
. .. during Reading Instruction 



Oded Erllch Gary Borlch 

Tel-Avlv University and The University of Texas 

^"^^1 at Austin . . " 

Attempts to find correlations between reading instruction and reading 
achievement have previously centered around methods of teaching reading (e.g., 
■hole word vs. phonics) (Chall, 1967). While soma tentative conclusions have 
been drawn about the relative effectiveness of various methods, no one method 

has-bemi-shown-CO"be~un^ueVtioffabry~sup^^^ 

studying factors related to reading achievement Is that" of observing operationally 
defined variables of teacher behavior and classroom Interaction and then relating 
them to reading achievement. This approach assumes that pupil-teacher classroom " 
interactions play a key role in producing pupil learning. By identifying class- 
room Interactions which increase pupil achievement, researchers can assist 
teachers in constructing an empirically validated instructional model for the 
teaching of reading. 

! Results from past correlational studies of teacher behaviors and student 
outcomes (Including, but not restricted to reading achievement) have been 
disappointing, with most correlations low or nonreplicable (Shavelson and Atwood, 
1977)7--OS^ "p?iiible"TeasH^n~fof 'thi" lack"orreiatlonshl7T^twI^n™'cl^^^^^^ '~" 
interactions and student achievement is that the generallzabllity of behavioral 
.easurements has not- been adequately examined or established to allow conclusions 
about relationships between teacher behavior and student outcomes to be drawn. 
In this paper we will be concerned with' the generallzabllity of classroom 
interaction measures during reading Instruction. 

78 



m 



The Concept of Generall^abill ty 



isi5^jSi^J*::■'';V.-■r:^;-vs^■Vi:-.,-^.:;u^ vi-..;.-.;-; ■ , , i , ;■ . ■. 

S^>->'v..'=." concept of generallzablllty Is based on the notion that the behavior 

f^^l^^opsevVBd, rmpressnts only a sample of the true behavior. If the sample of 



g:q.^V' measurements contain little or no error, the generalization to the 

^^"""^^^ behavior is sound; the accuracy of the measurement is 
'^.^ observed scores contain sizable error of measurement, the 
generalization to the characteristic behavior is tenuous; the accuracy is low ' 
g?vr,j^.Measures of teacher-pupil classroom Interaction contain potential sources of ' "^ ^ 
fSu'®".^'' (fac-ts) such as observation occasion, observers/ subject matter, etc 

Only By considering the effect of all these faceta can we'determire the extent 



A- 



CO which teacher behavior measures are ganaralizable. 

For example, in most studies of teaching process, a random sample of 
teachers is ohBrnr^rmd by two or more Tatars, The consistency with whieh the 
teachers are rank ordered on some variable such as "number of verbal reinforce- 
ments" or "number of questions the teacher asks" is interpreted as the 
reliability of the measuremenc. Typically each teacher's score is an average ^ 
of the raters' scores for that teacher and is usually interpreted as charac- 
teristic of the teacher asking questions or using verbal reinforcements. No 
doubc that the use of several raters provides a more precise measure on each 
teacher but what abouc the nature of the pupils taught, the teaching situation. 



TV 

^ ' * . A/ 



-'3 



'^;.J^-che subject matter taught, and other factors that might contribute to the 

;,r-- .instability of the teachers' behaVior? While the measurement is taken in one 

t-r' - J^^p^^^^^ setting and at one particular point in time, it is usually interpreted 

: as generalizing over many settings at different points in time. 

:r •■- -■ - -:-Only-a few studies on the generalizability of teacher behavior measures 

■ W^eported on more than one facet. Most have either explained how 



79 



ERLC 



3 



CO apply generallzablllty theory to examine the problems in measuring teaeher 
process variables or they have failed to use appropriately the data available. 
(See Erlich. 1976.) Two appropriate generallzability studies recently examined 
variables of studenc-teacher classroom interaction. Erlich and Borich (1976) 
analyzed classroom Interactions during nonreading class activities in the 2nd 
and 3rd grades. Erlich (1976) analyzed Sth grade teacher behaviors occurring 
during reading and math combined. Because different subject matters, e.g.. 
reading, math, social studies, may elicit different kinds and frequencies of 
pupil-teacher classroom Interactions, observation data of interactions occurring 
darlng~dif far-en r--su-Biif^t=fflaTtifrTa71iied separately ' ' " 

. Purpose 

The purpose of this study was to identify teacher-pupil Interactions 
occurring during beginning reading instruction and to examine the generaliz- 
ability of these measures of classroom Interaction. 

. ■ ■ .1 - - . - ■ - ■ ... . . : - - ■ . . ■ ■ 

Method 

Sample. The data analyzed in this study were collected during the second 
year of a two year replicated study of teacher effectiveness using the Brophy- 
Good Teacher-Child Dyadic Interaction System (Brophy and Evertson. 1976). 
Subjects were 26 teachers who had 5 or more yelrs oF"teachlni experience with 
their 3 most ricenc years of eKperience at the 2nd or 3rd grade level. These 
teachers were selected because they had produced consistent pupil learning on 
the Metropolitan Achievement Tests over three consecutive years. ^ Teachers- 
were observed from between three and seven times during teachers' reading 
instruction by two different raters who alternated across occasions. Four 



1. 



A linear pattern of either gain, constancy, or decline over the three-year 
Srophy?°j973)"' definition of consistent pupil learning In this study 

80 



teachers who had been observed on less than five occasions were eliminated 
from our analysis. For those teachers who were observed on more than five 
occasions, five occasions were selected at random for the analysis. Thus, the 
finaUdata analyzed included 22 teachers each observed on five occasions. 

Design. The design selected for the analysis was ~a one facet nested 
design; occasions being nested within teachers. Occasions were considered 
to be nested because teachers were observed at different times'^i'f day. on 
different days and teaching what may be considered different lesions. 

Even though an Implicit source of error, raters were not considered as a 
— I«.tentiri=3oBrce=af=errorr±n-=ttTis-analysrs=forW^^ 

raters had extensive training during the first year of the study and'^during 
the summer prior to , the second .year^" the study, enabling them to consiTtently' 
reach a 0.8 agreement. Furthermore, the criteria for agreement requirement 
that raters achieve the 0.8 reliability not only m their coding for each 
category in the observation system, but also on frequency counts within each 
category. Disagreements between raters were most often a result of one rater 
being able to code more information than another, and. therefore, the rank \ 
ordering of the teachers was not affected. This implies that there wis also a 
mlnirnal teacher-rater^interaction; and therefore, raters were considered not 
to be^ a potential source of error affecting the generallzabllity of tha measures. 

Instrument. ..The instrument, used. to collect data was- the Teacher-Child 

Dyadic Interaction Observation System (Brophy and Good. 1969). This Instrument 
attempts to code all dyadic interactions (teacher behaviors with respect to an 
Individual ,child.as well as the child 's" response and Interactions with" the " 
teacher) occurring m the classroom. It contains 167 variables divided. into 
two main categories: public response variables.. In which the teacher-chiid 
interaction occurs in a group setting; and private response variables, in which 
the teacher and- child confer privately, about the child's individual- work. 



. Within these two categories of variables, Brophy and Good Identified clusters 
of variables^ The public variables Included the following clusters: Teacher's 
Method of Selecting Students to Respond; Difficulty Level of Questions; Type 
^f Questions Asked (Academic or Nonacademlc) ; Quality of Student Response to 
Questions; Teacher's Feedback Reaction to Student Responses; Student Initiated 
Comments; and Student Initiated Questions. The private interaction variables 
were divided into three clusters: Child Created Contacts (CCC) ; Teacher Afforded 
Contacts (TAC); and Behavior Related Contacts. 

Statistical Analys is. The effect of the occasion facet on the generalizability 
of ceacher-child interactions was estlma t-id b^lhT^i^^cat'l^n of "Sberalizability 
theory (Cronbach, Gleser. Nanda, and Rajaratnum, 1972). In generalizability 
theory a generalizability study (G study) has two purposes. The first is to 
examine the generalizabl ilty of the measures (e.g., of teacher behavior) by 
:onsldering the potential sources of error (e.g.. occasions and raters) which 
affect Che reliability of measurements obtained. Based on this analysis, a 
G study then recommends variables for inclusion In future decision studies 
(D studies) which examine, for example, relationships between teacher behaviors 
and student outcomes. 

For each variable examined In this study, the G study analysis provided 
the estimate of the universe score (true score in classical theory) variance 
[-□ -(t)-]~,"and-the -estimate of the~ error variaiTc^, "which Tn^ ihir^desTgn was due' 
to Che teacher occasion interaction confounded with the .occasion variance and 
unidentified sources of error [ a^(o.to,e) ]. The formula for obtaining the 

coefficiant'of gene^alizablliTy in ~t design is ^ 

--' ^ ^ 2 2 

^ (t) + a (o, to,e)/n 

where n is "the number of occasions. Using this formula and based on the 

escimates of the variance components, the number of occasions (n) requiredSto 

obtain a prespecif led level of generalizability can be' calculated for each 
variable. . • 82 - 



char 



A generallzable variable was defined in this study as one for which a 
co.effxcient of generallzabillty of 0.7 could be obtained by observing the taac 
on tan or fewer observation occasions. Not only is ten a practical upper limit 
on the number of observation occasions which could be used, but also, and of greater 
Importance, teacher behaviors which require more than ten occasions to obtain a 
reliable estimate are usually inconsistent and fluetua ting, suggesting a need 
to redefine and/or reconceptualize these variables. 



Results 



Initial Inspection of the data revealed that a majorltyof the variables 
occurred infrequentlj, inconsistently, and were recorded for only a few teachers. 
This pattern of occurrence was. ^characteristic of all variables In three eluscers— 
Student-Initiated Questions, Student-Initiated Comments-, and Child. Created ^ ' ^ 
Contacts~and two sub-clusters—Opinion Questions and Non-Academic Self Reference 
Questions. Brophy and Evertson (1976) suggestid In their analysis that the 
classroom interactions represented by these variables may not be appropriate 
for teaching fundamental tool skills .such as reading and math in the 2nd and 
3rd trades. The rest of the low frequency variables were scattered throughout 
the remaining variable clusters. They appeared to be infrequent mainly because ^ 
of the detailed nature of the observation instrument which attempts to allow 
for all possible Interactions even when their occurrence Is not likely (e.g., 
praise after a wrong answer or criticism after a right knswer) . None of the 
low freauency variables described above appeared to play any appreciable role 
in primary reading; instruction In the classrooms observed and were, therefore, 
eliminated from the generalizabillty analysis. ' . 

- Another type of - low frequency variable was retained for analysis. These 
variables differed from tho.<5e previously described- in that the behaviors occurred 
for at least 20% of the teachers. -Thesejvariables may be important In . " 



distinguishing between effective and inef fee cive teachers despite their rela- 
tively infrequent occur^rence across teachers and their generalizabillty should 
be eKamined. aosa found, to be general lEable ahould "be included in correlational 
scudles of teacher-pupir classroom Interaction and student outcomes to determina 
if they are, in facts important variables in reading instruction* 

Tabled presents the results of the analysis for the classrooni interaction 
variables analyzed. Variables are grouped into five clusters based on those 
developed by Brophy and Good (1969), The first four clusters contain public 
interactions, and the last cluster contains private interactions. Each 
variable cluster is discussea^ separately. For each variable the table includes I 
the estimates of universe score variance [ a^(t) ] and error variance [ a-{o,to.e) 
and Che number of occasions required to reach a 0,7 level of generalizabillty/ 



INSERT TABLE 1 ABOUT HERE 



The first variable cluster. Teacher's Selection of Respondents, describes 
the x^ay in which the teacher selects students to respond to questions asked. 
The teacher may either preselect (name the child, who is to answer before askin; 
the question), select a child from among those who volunteer to answer, or 
select a nonvolunteer. If a student gives the answer before the teacher has 
time to select a student, this is labeled a "call-'Dut.** Relatively few 
occasions are heeded to obtain a reliable (generaligable) measure of the 
aelection of a volunteer, or a non^volunteer or of the frequency of call=outs 
(2, 3, and 4 respectively). The last variable, "preselection of a student'' is 
also generaliEable, but requires more occasions (9) to reach a 0 .7 level of 
generaliEability, , ' .^^ 



The next cluster. Type of Question, contalna variables related to the 
': type of questions asked. "Choice quastions," "product queatlons," and "process 

.quescions'' represent difficulty levels of academic questions. To anawer a 
; choice question, the child must select the corract answer from two or more ' 
options given by the teacher. To answer a product question, the chlldmust ^ 
give a specific correct answer which can be expressed In a slnile word, or 
short phrase. The process question, which Is the most complex , requires the 
-child to explain the steps which must be followed to solve a problem or to 
reach a conclusion. Two of the three variables in this cluster we^^^ 
be ganerallzable. "Product questions" and "c:hoica questions,'' thtf types ' 
found to occur most frequently^ in 'raailrig Instruction at these grade levels, 
require four and five occasions respectively to reach a ^0. 7 level of generali^^^^ 
abUity. "Process questions" is nonganeriltEable. requiring 16 occasions to 
reach the acceptable level of generalizabllity . 

The third cluster. Quality of Student Response to Questions, evaluates 
student answers to questions. Four variables were considered i "correct" and 
"part=correct," "wrong," and '^no response." All can be estimated by three or 
fewer occasions, Indicating that 6f these variables the behaviors are 
■highly consistent within a particular reading InBtructlon group. 

Only one variable in the Teacher Feedback Reaction to Student Responses 
cluster— praise following a correct answar^-occurred frequently enough to 
warrant analysis. Apparently, this is the only type of feedback which occurs 
regularly during reading instruction.j It needs only three observation 
occasions to obtain a 0.7 level of gekerallzability. 

- . The last cluster. Teacher Afforded' Contacts (TAC) contains private dyadic J 
interactions. TACs may be related to work, to procedures, or to a child's 
behavior. Only a.few variables in. this cluster were -analyzed because most 

85 



behaviors dccurred Infrequently." The measures of TAG variabies related to "V " : ' 
work and to management procedures were both nongeneraliEable. These teachers' 
behaviors, although occurring frequently, fluctuated so greatly that 13 and 18 
occasions would be needed to obtain a reliable estimate of their behavior. On 
the other hand, measures of interactions related to a child's behavior were 
quite consistent. All measures of behavlor=-related contacts are generallEable 
with the number of occasions required to reach a 0.7 level of generallzability 
ranging from 3 to 5. 

• : " ' Diacussion 
The findings above indicate that a majority of the variables 
analyzed can be considered as generaliEable if measured by the required number 
of observation occasions. It should be recalled, however, that all other 
Dyadic Interaction System variables not presented in the table exhibited such 
low frequency counts that they ware excluded from analysis . Although some 
of these might be found generaliEable, this generalizability statistically 
could result from the fact that their frequency of occurrence tends to be 
' consistently zero. 

The large number of Infrequent teacher-child dyadic interaction variables 
suggest^, that primary reading instruction consists of a limited range of such 
behaviors. These findings, however, do not exclude the possibility that some 
classroom interaction variables during reading instruction at higher grade 
levels might be more infrequent and/c consistent at these levels. If such 
is the case, these variables should be analyzed to determine their generallzability 

Ten observation occasions were selected as the maximum number allowed to 
reacha 0.7 level of generallzabiilty in this study. The number of occasions r 
required to reach this level -for those variables which were generalizable ranged 
from 1-9 occasions. Past classroom observation studies considering a range of 
subject matters and grade levels, have often used three or fewer occasions to 
measure teacher behaviors (Shavelson and Atwood, 1977). The present analysis - . 

. _ 86 



indiGates that some variablas require more than three occasions to be measured 
reliably. It should be noted, however, that In this study interactions occurring 
frequently during reading instruction miyr in general, be considered highly- 
consistent . Almost half of the generalizable variables could be measured reliably 
by the use of three obeervation occasions and approximately three quarters of 
them by the use of five obaervation occaaions. 

Classroom observation studies frequently observed teachers teaching different 
subject matters, but combined different subject matters for analysis. The Teacher 
Effectiveness Study (Brophy and Evertson, 1976) coded the reading data 
separately, allowing reading and non-reading class activities to be analysed 
separately, A comparison of the results of this study with chose of Erlich 
and Borich (1976) , who analyzed the generallzabillty of the n6h«reading 
activities, indicates that^classroQm inters during reading and non-reading 

InstE^ctlon differ in several significant waysV 

Reading instruction appears to be primarily a public process. With the 
exception of behavior-related contactsv almost all of the private interaction : 
variables occurred infrequently, Non^reading class activities appeared balanced 
between public and private Interactions and included many more private teacher^ 
child interactions (both teacher afforded and child created) , For example, 
in Erlich and Borich's analysis, the cluster of child created contacts contained 
the largest number of variables analyzed. In this study, the entire cluster 
was eliminated because so few instances of child created contacts during 
reading instruction were recorded. 

■ Teachers also asked different types of questions in reading and non-reading / 
instruction- During non^reading activities, almost all questions asked were 
^'product; questions, '^ '^Choice questions" appeared so Infrequently that this : 
variable 'was not even analyzed. During reading instruction, however,' choice 
?V®^J:f?"^ "^^^"^^^^ frequently and were highly generallzable (four occasions) , 



11 



: Teachars appeared to find choice questions particularly sultad to reading 
^^s^^^^tion, but not to other subjects. Teacher queBtlons were mora task 
oriented during reading Instruction. Self-reference questions were asked 
•during non-reading activities, but only academic questions occurred during 
..reading instruction. 

Teacher behaviors appeared influenced by the reading context in several 
other important ways. For example, selection of a nonvolunteer during 
non-reading activities was inconsistent and its measurement nongeneralizable, 
while the same behavior was highly consistent and its measurement generallzable 
during reading Instruction. The more consistent selection of nonvolunteers 
; during reading suggests that the teacher is more likely to insist upon involving 
the reluctant, shy, or non=assertive child during reading than during noh-feadini 
activities. Another noteworthy difference, occurred in the quality of student ' 
responses to questions.. The percentage of correct, wrong, part-carrect , and - 
no-response answers could be estimated in three or fewer occasions during 
.reading instruction, while the number of occasions required during non-reading 
activities was six or greater. This difference suggests that the teacher is 
more oonsistent In gauging the difficulty level of questions during reading 
instruction than during other activities. A final difference was that feedback 
type reactions were far more limited during reading instruction than during 
non=readlng instruction. Only one feedback response— praise after a correct 
response==was employed frequently enough during reading instruction to be 
considered for analysis. 

In summary, the findings of this study suggest that observation data for " 
reading instruccion should be analyzed separately from data obtained during 
other types of instruction. Behaviors observed during, say, math or social 
studies may not occur during reading, and conversely, reading instruction may 
elicit behaviors unique tu that context. This study found that reading 



12 



instruction encompasaad a narrower range of pupil-teacher classrooni interaction 
Chan that found during non-reading Instruction in the same clas.sroonia. Even 
when the same behaviors occurred across subject matters, measures of th^se ' 
behaviors may be generalizable in one context and not In the other.; or tha -'- 
number of occasions necessary to reach an acceptable level of generallzability : 
may differ. In planning future observational studies of reading instruction, 
reaearchers should rely upon the findings of this study to ascertain the - 
appropriate number of observations needed to obtain generallEable ineasures of ' 
teaching behavior during reading instruction. 




13 



Table 1 



Estimate of Universe Score Variance and Error Varlanca/ 
and Number of Occasions Required to Reach 0.7 Level of GenerallEabillty 
for Dyadic Interaction Variables during Reading Inscructlon 



Teachers' Selection of Respon dents 
Selects volunteer 
Selects Nonvolunteer 
Call--outs by student 
Preselects student 

Type of Question 

Choice questions 
Product questions 
Process questions 

Quality of Studen t Res ponse to . . 
Quescions 

Part'-correct 

Correct • % 

Wrong 

No Response 

Teacher Feedback Reaction to ' 
_■ Student Responaes 

Praise following correct answer 



^2 

o (t) 



105.83 
258;09 
10.86 
14.68. 

162.45 

273,78 
2.42 



5.69 
384 .2^ 
19.09 
6.96 



35 , 34 



g (o^tOte) 

161.93 
381.72 . 
19,49" 
59.74 



266,64 
608.93 
16:64 



3.15 
342.09 
21.09 
10.43 



41.34 



Number of 
Occasions 



2 
3 
4 
9 



4 

5 
16 



1 

2 
3 
3 



(Table continued on next page.) 



90 



14 



Table 1 (cont.) 



Taachar Afforded Contacts 

Work contact involving brief 
contact 

Procedural managatnent contacts 
Behavioral related contacts. 



Contacts involving no taachar 
error 

Goh tacts involving teacher : 
warning 

Contacts involving teacher 
criticism 



"2 



5.41 
5.55 

8.45 
4,80 
0.97 



^2 

g (o»to,e) 



30.45 
42.80 

11.08 
7,87 
2.22 



Numbar of 
Occasions 



13 
18 

3 
^4 
5 



ERIC 



91 



15 



References 



Brophy, J. E. Stability in teacher effectiveness. American Educational 
Research Journal , 1973. 10, 245-252. - ^ — ^ 

Brophy, J. E., & Evertson, C. E. Learning from eeachine. ' A develnp m^T,.. i 
perspective. Boston: Allyn and Bacon, Inc., 19767 

Brophy, J. E., & Good, T. Teacher-child dyadic interactloni A m anual for 

coding classroom behavig r. Austin. Texas; The Research and Developme nt 
Center for Teacher Education, The Unlvarsity of Texas, 1969. 

C^^^^^ G. Learn ing to read: The great debate . New York: McGraw-Hill 
Book Company, 1967. . 

CronbachvL. J., Gleser. G. ^ Nanda, H., & Ilajaratnara, N. The dependabi lity^ 
of^behavioral measurements: The theory of aeneraliaaMlT;^ f^^. 
and profiles. New York: John Wiley and Sons , Inc. , 1972. ~~~ . 

Erlich, 0. A study of the generallzablllty of measures of teacher behavior ^: 
Unpublished docCoral dissertation. University of California, 1976. — 

Erlich, 0., & Borlch, G. D. Measuring classroo m interactions; How many " 
occasions are required to measure them reliably? Austin. TaxaB- - 
The Research and Development Center for teacher Education, Tlie Unlversitv 
of Texas, 1976, 

Shavelson, R., & Atwood, N. GenerailEablllty of measures of teaching 

process. Inc. D. Borlch, The appraisal of ceachlng: Concepts an d 
process . Reading, Massachusetts:. Addison-Wesley, 1977. : — 



92 



ERIC" 



